Week 1

Person-centric Data

Data from the people, about the people! Person-centric data covers a broad range of potential forms and sources. It might be survey data with attitudinal variables, purchase history, number of visits to the doctor’s office, number of alcoholic drinks had during the last seven days, internet browsing behavior, or even Tweets. If we start to consider all of the data that we see, it becomes clear that a great deal of data is person-centric.

Not all data fits into this notion of person-centricity. We often find financial data (think end of day trading) that really does not fit into our notion of person-centric data. Even though this type of stock-flavored financial data does not really fit, the amount of money a person spends certainly does. Furthermore, much public data exists for municipal entities, such as county assessors. It might seem like a properties record is not person-centric, but we could build data that tracks a person’s property buying/selling – we would certainly look at such data and consider it to be behavioral in nature.

Where Do We Go?

What do we do with person-centric, behavioral data? In general, we want to predict or explain behavior in some way! To do that, we need to create some hypotheses. We must, however, be very careful in crafting our hypotheses. While we can do some fancy analyses that start to lead us down a more causal path, we can never fully generalize what makes people do the things that they do.

This will be a theme that is revisited during our time together. Just so that you can start to fully get your head around hypotheses, let’s take the first try at a little game.

I am noticing that visitors to my site spend less time on one page compared to other pages. Here is what that data might look like:

  visitorID pageID secondsOnPage
1         1      a     16.152030
2         1      b     15.357179
3         1      c     14.385829
4         2      a     21.789992
5         2      b     13.612962
6         3      a     11.147774
7         3      b     14.483975
8         3      c      8.654847
9         3      d     16.865599

I have a visitor identification number, a page identifier, and the number of seconds each user spent on a given page. Do I have data that would allow me to test if there is a difference in the amount of time that people spend on a page? I certainly do! With the data that you see, could I determine exactly why there might be differences in time on page? I would say rather unlikely with the provided data. We always need to make sure that we are not speaking beyond our data!

This was just a demo version of the game – you will unlock more difficult levels as we progress.

Analyzing Behavioral Data

On \(R^2\)

Think back to your linear models course. Do you remember examining adjusted \(R^2\)? It is one of the more popular measures of how well a particular model explains the data (adjusted \(R^2\) is noted as how much variance in the dependent variable is explained by the predictor variables in the equation). We traditionally are pushed towards looking for very high values. Students come from a variety of statistical traditions have been taught a great many “thresholds” for acceptable \(R^2\) values – anywhere from .4 to .7 can be mentioned all in the span of a few seconds. If, however, we are trying to explain what someone does (i.e., a particular behavior), what are the chances that we will obtain such adjusted \(R^2\) values? Probably not very high, right? People are messy and often unpredictable, so achieving huge \(R^2\) values is unlikely.

Does a small \(R^2\) mean that your model is garbage‽ Of course not! A small \(R^2\) does absolutely nothing to take away from the relationships between your predictor variables and your outcome. A low \(R^2\) might mean that you have more error in your prediction, but there are still better ways of figuring that out than just looking at one value and giving up.

This is not to suggest that we should accept an \(R^2\) of .000001 as the most predictive model in the world (note – it probably is not); however, we should not be terribly dejected to see a model with 4 or 5 predictors with an \(R^2\) around .2. That means that we have taken one of nature’s sloppiest and most unpredictable creatures, and explained 20% of the variance of a behavior.

With that aside on \(R^2\) out of the way, we can think more generally about analyzing behavioral data (that is why we are here). The Behavioral Sciences have a rich tradition of interesting analyses: from experimental design to causal models and all points in between. When dealing with behavioral data, we are frequently looking at some type of latent variables; these are constructs we think exist, but have no way to physically touch them. We are going to be dealing in the latent world for the majority of our time together, so start thinking about latent variables now!

Data Wrangling

Person-centric data comes in many different flavors.

There is the classic cross-sectional, wide format (tends towards a vanilla taste):

  id satisfaction leftTip
1  1            1       1
2  2            4       1
3  3            3       1
4  4            4       0
5  5            2       1

We also have repeated measures:

  id observation satisfaction leftTip rating
1  1           1            1       1      3
2  1           2            5       0      7
3  1           3            5       0      6
4  2           1            1       0      2
5  2           2            2       1      3
6  2           3            4       1      5
7  3           1            5       1      7
8  3           2            1       1      3
9  3           3            1       1      3

And. key-value pairs:

       variable value
1            id     1
2            id     1
3            id     1
4            id     2
5            id     2
6            id     2
7            id     3
8            id     3
9            id     3
10  observation     1
11  observation     2
12  observation     3
13  observation     1
14  observation     2
15  observation     3
16  observation     1
17  observation     2
18  observation     3
19 satisfaction     1
20 satisfaction     5
21 satisfaction     5
22 satisfaction     1
23 satisfaction     2
24 satisfaction     4
25 satisfaction     5
26 satisfaction     1
27 satisfaction     1
28      leftTip     1
29      leftTip     0
30      leftTip     0
31      leftTip     0
32      leftTip     1
33      leftTip     1
34      leftTip     1
35      leftTip     1
36      leftTip     1
37       rating     3
38       rating     7
39       rating     6
40       rating     2
41       rating     3
42       rating     5
43       rating     7
44       rating     3
45       rating     3

And then from there, we can start to do combinations:

   id     variable value
1   1  observation     1
2   1  observation     2
3   1  observation     3
4   2  observation     1
5   2  observation     2
6   2  observation     3
7   3  observation     1
8   3  observation     2
9   3  observation     3
10  1 satisfaction     1
11  1 satisfaction     5
12  1 satisfaction     5
13  2 satisfaction     1
14  2 satisfaction     2
15  2 satisfaction     4
16  3 satisfaction     5
17  3 satisfaction     1
18  3 satisfaction     1
19  1      leftTip     1
20  1      leftTip     0
21  1      leftTip     0
22  2      leftTip     0
23  2      leftTip     1
24  2      leftTip     1
25  3      leftTip     1
26  3      leftTip     1
27  3      leftTip     1
28  1       rating     3
29  1       rating     7
30  1       rating     6
31  2       rating     2
32  2       rating     3
33  2       rating     5
34  3       rating     7
35  3       rating     3
36  3       rating     3

All of these have their place. Many analyses require traditional wide data. Others, like panel models, require long data (e.g., our repeated measures data).

library(plm)

panelModel = plm(rating ~ satisfaction, index = c("id", "observation"), 
                 data = repeatedMeasures)

summary(panelModel)
Oneway (individual) effect Within Model

Call:
plm(formula = rating ~ satisfaction, data = repeatedMeasures, 
    index = c("id", "observation"))

Balanced Panel: n = 3, T = 3, N = 9

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.598291 -0.068376 -0.017094  0.136752  0.401709 

Coefficients:
             Estimate Std. Error t-value  Pr(>|t|)    
satisfaction  0.94872    0.06784  13.985 3.362e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    24
Residual Sum of Squares: 0.59829
R-Squared:      0.97507
Adj. R-Squared: 0.96011
F-statistic: 195.571 on 1 and 5 DF, p-value: 3.3615e-05

Sometimes, we need to reshape our data for the sake of visualization (such visualizations are especially handy for survey questions).

  id satisfaction1 satisfaction2 satisfaction3
1  1             7             3             4
2  2             7             3             7
3  3             5             5             6

If we want to show several people’s survey responses, we need to do some reshaping:

See if you can reproduce the data and visualization! Just as a hint, you will need to make your wide data into long data (melt from reshape2 or gather from tidyr should do the trick).

No matter the need, don’t be afraid to shape your data in a way that conforms to your needs. Always remember that you are in control of your data – your data is not in control of you! With various packages from base, the tidyverse, and beyond, you have the power to make your data be whatever you need it to be. While you cannot control the sloppy data that people produce, you can control the data format.

Ethical Considerations

When dealing with person-centric data, we need to always be thinking about ethics. Are we treating the data in a secure manner and honoring people’s privacy? Are we being honest to people about how we are going to handle their data? These are questions that require your care and attention.

This becomes even more important when sharing data. Look at the following data and think about whether or not it should be shared as is:

data.frame(age = sample(18:35, 8, replace = TRUE), 
           gender = sample(c("male", "female"), 8, replace = TRUE), 
           department = sample(c("cs", "hr", "it", "accounting"), 8, replace = TRUE), 
           score = sample(50:100, 8)) 
  age gender department score
1  25 female         it    81
2  21   male accounting    92
3  24   male         cs    77
4  32   male         cs    87
5  19 female accounting    51
6  23 female         it    63
7  33   male         it    78
8  18   male accounting    61

There are not any personal identifiers here, so we are good to put it on Github or Google Drive and share away…right? Let’s pick this apart a little bit. Any of the demographic-inclined variables, in isolation, are not a cause for any concern; when we have all of them together, though, we might be able to identify individuals through “triangulation”. This becomes especially true when there are groups with fewer individuals. Triangulation is a real concern with any type of data, but we need to be diligent in good security and privacy practices when we are dealing with person-centric data. This gets into an area known as de-identification – the removal of not only concrete identifiers (e.g., names, SSNs), but also anything that can be pieced together to identify a person. So, what items would we need to remove from our data to make it de-identified?

It might seem far fetched that anyone would ever go through the trouble of identifying your data, but you should think conspiratorially when it comes to your data – imagine the absolute worst thing that could happen if someone took it and identified it. Could it cause personal problems, job lose, or reveal something embarrassing/illegal?

Remember, the 4Chan collective found a flag in rural Tennessee using contrail patterns, star alignment, and a car horn – in 37 hours. Don’t forget that they also contributed to the identification and subsequent airstrike of an ISIS training facility. Just imagine what could be done with your interesting data.

On Anonymity

It really does not exist – ever. If you are in charge of collecting any type of person-centric data, you can never promise complete anonymity; doing so is truly an amateur move and demonstrates a misunderstanding of behavioral data. To guarantee anonymity is to guarantee that there is no possible way that a person could be identified – this is almost never the case. You may, on the other hand, promise confidentiality. That means that you will take every possible precaution to ensure that a person’s data is not broadcast or imprudently shared.

Week 2

Data Types And Structures

We see many different types of data when working with behavioral data, well beyond what we typically think about with our different levels of measurement (nominal, ordinal, interval, and ratio).

Attitudinal Data

Survey data is where we typically find attitudinal data and it tends to present some interesting issues. How are the items measured – are they traditional Likert response options (strongly disagree to strongly agree) or are they some type of feelings thermometers (typically some crazy range that could be anywhere between 0 to 100 or -100 to 100). Perhaps it is a Likert-type (Not at all to Very much). How many response options are there – 3, 5, 7, or more?

When dealing with such attitudinal data, we might see data that takes any of the following forms:

library(magrittr)
data.frame(agreementNumeric = 1:5, 
           agreementOrdered = ordered(1:5, labels = c("Strongly disagree", 
                                                      "Disagree", "Neither disagree nor agree", 
                                                      "Agree", "Strongly agree")), 
           agreementCharacter = c("Strongly disagree", 
                                  "Disagree", "Neither disagree nor agree", 
                                  "Agree", "Strongly agree"), 
           pseudoLikert = c("Not at all", "A little", "Some", "A lot", "Very much")) %T>%
  glimpse() %>% 
  summary()
Observations: 5
Variables: 4
$ agreementNumeric   <int> 1, 2, 3, 4, 5
$ agreementOrdered   <ord> Strongly disagree, Disagree, Neither disagr...
$ agreementCharacter <fct> Strongly disagree, Disagree, Neither disagr...
$ pseudoLikert       <fct> Not at all, A little, Some, A lot, Very much
 agreementNumeric                   agreementOrdered
 Min.   :1        Strongly disagree         :1      
 1st Qu.:2        Disagree                  :1      
 Median :3        Neither disagree nor agree:1      
 Mean   :3        Agree                     :1      
 3rd Qu.:4        Strongly agree            :1      
 Max.   :5                                          
                  agreementCharacter     pseudoLikert
 Agree                     :1        A little  :1    
 Disagree                  :1        A lot     :1    
 Neither disagree nor agree:1        Not at all:1    
 Strongly agree            :1        Some      :1    
 Strongly disagree         :1        Very much :1    
                                                     

This is where data familiarization and exploration becomes important – always take note of variable types. If you are receiving this data from another source, you never know what you might be getting.

With the data as it is in its current form, think about what you might like to do with it. Would you like to plot it or create a correlation matrix? Can you make that happen? Data processing becomes very important when working through any survey-based data.

What functions would you use to convert everything to numeric?

Censored Data

We might run into data that is censored. While it might sound like something exciting at first, censored variables are simply those that we do not know about the values before or after data collection started/ended. You may run into variables that have been discretized – a continuous variable, such as age, broken down into discreet categories. When we get that age variable, we might see categories like “<25”, “25-35”, “36-45”, “46-55”, and “55+”. When people have an age of “<25”, we have left censored data – we don’t know how far to the “left” the value actually is, we just know that it is somewhere under 25. For our observations with “55+”, we have right censored data – we don’t know how far to the right the value actually is.

  ageCategory       censored
1         <25  Left censored
2       25-35     Uncensored
3       36-45     Uncensored
4       46-55     Uncensored
5         55+ Right censored

While we won’t be diving into them, tobit models are one such way for dealing with censored dependent variables.

The following is similar to an example from the AER package. The max number of reported affairs is 12, but that was at the time of reporting – the philandering partners could have engaged in more affairs after data was collected. To that end, we need to account for right censoring.

library(AER)

data("Affairs")

fm.tobit = tobit(affairs ~ age + yearsmarried + religiousness,
                   right = 12, data = Affairs)

summary(fm.tobit)

Call:
tobit(formula = affairs ~ age + yearsmarried + religiousness, 
    right = 12, data = Affairs)

Observations:
         Total  Left-censored     Uncensored Right-censored 
           601            451            112             38 

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.30266    3.04944  -0.099   0.9209    
age           -0.23709    0.11150  -2.126   0.0335 *  
yearsmarried   0.92957    0.19594   4.744 2.09e-06 ***
religiousness -2.59303    0.58544  -4.429 9.46e-06 ***
Log(scale)     2.45733    0.08253  29.777  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Scale: 11.67 

Gaussian distribution
Number of Newton-Raphson Iterations: 4 
Log-likelihood: -661.8 on 5 Df
Wald-statistic: 36.75 on 3 Df, p-value: 5.1855e-08 

Special Note On Discretization

If you find yourself on the planning end of some broader data collection efforts, absolutely resist the temptation (both internal or external) to discretize data. Some people will say, “But response rates…blah…blah…blah”, or, “We have to make it easy.”, or, “Think of the children.” Consider the following data:

    age
1   <25
2 25-35
3 36-45
4 36-45
5 46-55
6   55+

You have 6 respondents – what is the mean age of those respondents? I have no idea and neither do you! If you allow data to be collected this way, you are immediately imposing limits on what you might do with that data in the future. Someone will likely say, “Just take the middle point of the range and call that the response.”

  age
1  25
2  30
3  40
4  40
5  50
6  55
  mean(age)
1        40

Great.~ We have an average now. What if the actual ages were as follows:

  age
1  21
2  30
3  36
4  45
5  52
6  74
  mean(age)
1        43

What is your mean age now? Is it still 40? Will it matter for your analyses? You need to decide if the difference in the answer is important or not. Maybe you decide it is not that different and are comfortable doing this. Let me pose an additional question: is a 25 year old the same as a 35 year old? Just think about that for your own self: is 18 year old you the same as 24 year old you? This becomes an even bigger issue when we have ranges with unequal intervals (commonly seen in income brackets that might start with a few thousand dollars in range, but end with upwards of 100K dollars or more in range).

And here is another common issue: someone will want to discretize a variable and use it as an outcome variable in a model. Such shenanigans is common with discretized income variables. You have the power to make that happen (think back to our friend ordered logistic regression from generalized linear models). To the person who was expecting a standard linear regression, they are going to be really confused when you show them more than 1 intercept!

Remember, it is okay to make things easy on people, but give them some credit. Most people know how old they are – if they won’t tell you their exact age, they probably won’t be inclined to tell you the range either.

Hierarchical Data

Also known has multilevel or nested data, hierarchical data is best defined by a classical example – we could have students, nested within classrooms, nested within schools, nested within districts, and so on.

Here is what some nested data might look like:

   id subgroup group
1   1        1     1
2   2        1     1
3   3        1     1
4   4        2     1
5   5        2     1
6   6        2     2
7   7        3     2
8   8        3     2
9   9        3     2
10 10        4     2
11 11        4     3
12 12        4     3
13 13        5     3
14 14        5     3
15 15        5     3

We have an id variable, which is giving us an individual identifier. Each person is in 1 of 5 subgroups and each subgroup is in 1 of 3 groups.

Depending on your needs, this structure can be utilized in your analyses (we are going to be getting into it within the next few weeks).

Repeated Measures

Repeated measures can be thought of in the same way as longitudinal data – we are measuring the same thing for a person for an extended period of time. If you want, you can even think of it as nested data; the time-based observations are nested within an individual.

Here is how that data will typically look:

data.frame(id = rep(1:3, each = 3), 
           observation = rep(1:3, each = 3), 
           responseTimeSeconds = sample(1:5, 9, replace = TRUE))
  id observation responseTimeSeconds
1  1           1                   1
2  1           1                   2
3  1           1                   5
4  2           2                   5
5  2           2                   1
6  2           2                   4
7  3           3                   3
8  3           3                   4
9  3           3                   1

This data shows us 3 different people, each with 3 different observations.

Text

Welcome to the now! With people generating volume after volume of text, we have many questions that we can explore.

We can also have a bit of exploratory fun!

You can likely imagine the source, which gets back to our conversation last week about ethics and confidentiality. A great deal of data is public and out there for the taking, but not all of it is.

Sources Of Data

When we are dealing in behavioral data, there are many places that we can typically find it. Many are obvious, like surveys. Others, however, might require a little bit more imagination.

Surveys

Self-reported surveys might be the classic source for behavioral data. With the rise or web-based surveys, collecting data has never been easier. Although response rates have been declining over the years, it is still easy to amass a reasonably-sized sample (especially if you can pay participants). In addition to collecting your own data, there are numerous big surveys that make the data available to the public (e.g., Behavioral Risk Factor Surveillance System and National Survey on Drug Use and Health).

Client-side Paradata/Metadata

Surveys provide data that people explicitly offer, but what about the behaviors that go into survey esponding (or any behaviors that lead up to an action). Paradata and metadata offer such data. While both relate more to the “hidden” data that people produce, they are a bit different. Metadata is all of the little stuff that happens when people use a site – login times, time on page and navigation, and other data related to use. What was once thought of in the context of person-to-person surveys, client-side paradata (CSP) details the small things that people do. Mouse movements, clicks, scrolling, and related variables can all be collected and analyzed.

Records

Records tend to be a bit tricky, but they are out there (in fact, a great many organizations make money hand over fist by pulling public records together). While you might be granted to keys to kingdom and be given access to such databases, you can pull public records from many different places: county assessors websites and county courthouse websites are two such examples (jailtracker is fun to use). Many cities across America are also diving into the open data game and providing a great wealth of data. While it might not all be useful, there is certainly enough there to begin thinking about interesting questions.

Social Media

Social media is ripe for data (being mindful of Terms of Service, of course). Post history, both what was posted and in frequency, are excellent data points. Social networks are also a popular research endeavor.

Week 3

Data Exploration

Although we have identified our hypotheses and gotten our data into shape, that does not mean that we should just go right into hypothesis testing.

We can use the power of R to explore our data in many different ways. Perhaps you would like to dive right in and examine the relationships within your data. You might be tempted to construct a correlation matrix:

corDat = data.frame(item1 = sample(1:7, 50, replace = TRUE), 
           item2 = sample(1:7, 50, replace = TRUE), 
           item3 = sample(1:7, 50, replace = TRUE), 
           item4 = sample(1:7, 50, replace = TRUE), 
           item5 = sample(1:7, 50, replace = TRUE))

cor(corDat)
            item1        item2       item3        item4       item5
item1  1.00000000 -0.133509603 -0.15432422 -0.097453039 -0.48378554
item2 -0.13350960  1.000000000  0.13222542 -0.006078968 -0.12516972
item3 -0.15432422  0.132225415  1.00000000 -0.325306407  0.07113753
item4 -0.09745304 -0.006078968 -0.32530641  1.000000000  0.11022161
item5 -0.48378554 -0.125169722  0.07113753  0.110221608  1.00000000

Quickly, where are the strongest correlations? It probably took a few seconds to process the whole correlation matrix; now imagine doing that with many variables.

This is where corrplot (or corrgrams to some) come in handy.

corrplot::corrplot(cor(corDat))

This gives us a much quicker way to find the strong relationships amongst our variables. It can also give us a good idea about how items might shake out into components:

set.seed(5)

df = data.frame(item1 = sample(1:7, 10, replace = TRUE), 
           item2 = sample(1:7, 10, replace = TRUE), 
           item3 = sample(1:7, 10, replace = TRUE), 
           item4 = sample(1:7, 10, replace = TRUE), 
           item5 = sample(1:7, 10, replace = TRUE), 
           item6 = sample(1:7, 10, replace = TRUE), 
           item7 = sample(1:7, 10, replace = TRUE), 
           item8 = sample(1:7, 10, replace = TRUE), 
           item9 = sample(1:7, 10, replace = TRUE), 
           item10 = sample(1:7, 10, replace = TRUE), 
           item11 = sample(1:7, 10, replace = TRUE), 
           item12 = sample(1:7, 10, replace = TRUE), 
           item13 = sample(1:7, 10, replace = TRUE), 
           item14 = sample(1:7, 10, replace = TRUE), 
           item15 = sample(1:7, 10, replace = TRUE))
cor(df) %>% 
  corrplot::corrplot(., order = "FPC")

Knock Knock

Who’s There?

Orange

Orange Who?

Orange you glad that you did not need to look through a 15 by 15 correlation matrix to find the strongest relationships?

While it might take you a bit to appreciate statistical humor, you will immediately see benefits to such plots.

You can even make them more art than utility:

df %>% 
  cor() %>% 
  as.matrix() %>%
  reshape2::melt(.) %>% 
  ggplot(., aes(Var1, Var2, fill = value)) +
    geom_raster(interpolate = TRUE) +
  scale_fill_gradient(low = "#00ffd5", high = "#ff5500") +
  theme_minimal()

Not only can you quickly explore your data for hotspots, you can also create desktop backgrounds.

Heatmaps can be used in conjunction with distance measures to tell how similar observations might be. Instead of looking at a massive distance matrix, we can compute the distance matrix, and pass it along to our visualization to see how similar (or not) these 50 people are to each other (this gets into clustering, which we will be seeing later in the course).

Apophenia

Look at this visualization:

You see the relationship, don’t you?

Hopefully, you aren’t seeing much of anything – these are just two random normal distributions plotted. If you did start to see some pattern, don’t worry – this just means that you are indeed a human being. We have a natural flaw that makes use see patterns when they are not really there; this is why people often see Jesus in chips and on their toast.

Let’s try this out.

Do any of those visualizations pop out to you? In other words, is there one that is absolutley not like the others? Not really. However, one of those visualizations contains the actual data – the other three have random data for one of the variables. Hopefully, you didn’t trick yourself into seeing anything terribly strong.

Let’s try it with something that might have slightly stronger relationships:

Now, can you spot the real relationship in this one? Probably!

Let’s bump the distractors now:

How’s that? Is it easy to spot the one that is markedly different?

There is a point to all of this – don’t get fooled by false patterns. Just because we are studying human behavior and the data it produces, does not make us immune from the trappings of being human.

Factor Analysis

Measurement & Latent Variables

If I asked you your height, could you tell me? How do you know your height and what allowed you to know your height? Like most, you have probably been measured at the doctor’s office with some device marked with inches.

Now, if I were to ask you how you felt, what would you tell me? You might say that you are feeling well or not feeling too well, but how did you measure this? Did you use your wellness measuring tape or your wellness balance beam – unlikely. Instead, you likely thought about everything going on – your current health, general mood, and whatever else you deemed important at the time. Put another way, I could essentially be asking your about your affect (mood). Unfortunately, we can’t measure your affect like we can measure your height. There is no agreed upon method for measuring your current affect (there isn’t even an argument between Imperial and metric to be had).

Affect is an example of a latent variable – we have a pretty good idea that it exists and we have operationalized it, but there is no way that we can physically measure it. If we think of what latent tends to mean, hidden, we start to understand what latent variables are – variables that are hidden, but still interesting. Instead, we have to rely on a series of questions that gets us close to what we think is affect. This process of “getting at” affect through questions is what is known as measurement. You have been involved with measurement for a very long time, even if you did not know it. If you ever took a high-stakes standardized test (e.g., GRE, GMAT, LSAT), then you were directly involved in measurement – we can’t actually measure the mass of your mathematical reasoning ability, so we have to ask questions to measure that ability.

An important part of measurement concerns error. Error can come from many different sources (measure, person, environment, etc.) and “distorts” the observed score.

On Principal Compenents Analysis And Factor Analysis

There might be a feeling of dejavu creeping in here. You have already learned about a technique used for taking items and reducing them down – principal components analysis (PCA). The logical question is are factor analysis and pca the same things? They are both matrix factorization techniques, but the similarities start to end pretty quickly after that.

  • Causal direction: In PCA, the items are the cause of the component (think how ingredients combine to make a pancake – the pancake exists because the ingredients exist). In factor analysis, the latent factor gives rise to the items (i.e., the items reflect the construct – questions for mathematical reason exist because there is a construct for mathematical reasoning that exists beyond the existence of the items).

  • Component/Factor Interpretation: These is no interpretation of a component – it simply exists to reduce the variables. Factors represent the latent variables are important.

  • Variance: PCA attempts to extract as much variance as possible from the items. Each succesive component after the first will extract less variance than the first

  • Measurement error: PCA assumes perfect measurement

The psyc package will make life easy for us. Let’s see what these differences mean in terms of our output:

library(dplyr)

library(psych)

testData = bfi %>% 
  select(starts_with("A", ignore.case = FALSE), 
         starts_with("C"))

testPCA = principal(r = testData, nfactors = 2, rotate = "none")

testFA = fa(r = testData, nfactors = 2, rotate = "none")

testPCA
Principal Components Analysis
Call: principal(r = testData, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1   PC2   h2   u2 com
A1 -0.32 -0.45 0.31 0.69 1.8
A2  0.59  0.48 0.59 0.41 1.9
A3  0.61  0.51 0.63 0.37 1.9
A4  0.56  0.26 0.38 0.62 1.4
A5  0.56  0.43 0.51 0.49 1.9
C1  0.49 -0.47 0.46 0.54 2.0
C2  0.58 -0.44 0.52 0.48 1.9
C3  0.55 -0.37 0.44 0.56 1.8
C4 -0.60  0.41 0.53 0.47 1.8
C5 -0.58  0.34 0.46 0.54 1.6

                       PC1  PC2
SS loadings           3.02 1.79
Proportion Var        0.30 0.18
Cumulative Var        0.30 0.48
Proportion Explained  0.63 0.37
Cumulative Proportion 0.63 1.00

Mean item complexity =  1.8
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09 
 with the empirical chi square  2262.93  with prob <  0 

Fit based upon off diagonal values = 0.86
testFA
Factor Analysis using method =  minres
Call: fa(r = testData, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
     MR1   MR2   h2   u2 com
A1 -0.26 -0.29 0.15 0.85 2.0
A2  0.55  0.40 0.46 0.54 1.8
A3  0.59  0.47 0.56 0.44 1.9
A4  0.47  0.18 0.25 0.75 1.3
A5  0.51  0.34 0.37 0.63 1.7
C1  0.42 -0.36 0.31 0.69 2.0
C2  0.51 -0.37 0.40 0.60 1.8
C3  0.47 -0.29 0.31 0.69 1.7
C4 -0.53  0.37 0.42 0.58 1.8
C5 -0.51  0.29 0.34 0.66 1.6

                       MR1  MR2
SS loadings           2.40 1.18
Proportion Var        0.24 0.12
Cumulative Var        0.24 0.36
Proportion Explained  0.67 0.33
Cumulative Proportion 0.67 1.00

Mean item complexity =  1.8
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  2.03 with Chi Square of  5664.89
The degrees of freedom for the model are 26  and the objective function was  0.17 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

The harmonic number of observations is  2762 with the empirical chi square  397.07  with prob <  5e-68 
The total number of observations was  2800  with Likelihood Chi Square =  468.37  with prob <  1.2e-82 

Tucker Lewis Index of factoring reliability =  0.864
RMSEA index =  0.078  and the 90 % confidence intervals are  0.072 0.084
BIC =  261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                  MR1  MR2
Correlation of (regression) scores with factors   0.9 0.82
Multiple R square of scores with factors          0.8 0.67
Minimum correlation of possible factor scores     0.6 0.34

While items might load on the same factor/component, the magnitudes are much different. We also see differences in communality (h2 – item correlations with all other items) and item uniqueness (u2 – variance that is unique to the item and not the factor/component). We can also see that our PCA extracted more variance from the items than did our EFA (.48 to .36).

Determining Factor Numbers

Determining the appropriate number of factors in an exploratory factor analysis can be complex. How many might theory dictate? Is there a strong theory at all? In an exploratory setting, it is helpful to conduct something called a parallel analysis (PA). PA is a Monte Carlo simulation that takes into account the number of items and rows within your data and then produces a random matrix of the same shape.

psych::fa.parallel(testData)

Parallel analysis suggests that the number of factors =  4  and the number of components =  2 

This is the most automated way of finding a potentially suitable number of factors. We are given some output appropriate for both factor analysis and principal components analysis, with a graphical representation provided. In this case, parallel analysis would suggest 4 factors to be retained for a factor analysis. If some theory exists, you can also use that information to guide your factor numbers.

While you will often get some different results, you can also use the nfactors function:

testData %>% 
  na.omit() %>% 
  cor() %>% 
  psych::nfactors()


Number of factors
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.67  with  2  factors
VSS complexity 2 achieves a maximimum of 0.78  with  5  factors
The Velicer MAP achieves a minimum of 0.03  with  2  factors 
Empirical BIC achieves a minimum of  -53.47  with  3  factors
Sample Size adjusted BIC achieves a minimum of  -8.49  with  4  factors

Statistics by number of factors 
   vss1 vss2   map dof   chisq     prob sqresid  fit RMSEA BIC SABIC
1  0.54 0.00 0.046  35 8.4e+02 1.7e-154     7.4 0.54 0.152 602 712.9
2  0.67 0.73 0.034  26 1.6e+02  7.4e-22     4.3 0.73 0.073 -16  66.3
3  0.57 0.74 0.057  18 9.0e+01  1.4e-11     3.8 0.76 0.064 -34  23.0
4  0.62 0.77 0.083  11 3.3e+01  6.2e-04     3.1 0.81 0.045 -43  -8.5
5  0.61 0.78 0.118   5 1.3e+01  2.6e-02     2.5 0.85 0.040 -22  -6.0
6  0.55 0.76 0.185   0 2.0e-01       NA     2.3 0.86    NA  NA    NA
7  0.49 0.70 0.287  -4 2.6e-08       NA     2.5 0.84    NA  NA    NA
8  0.46 0.71 0.453  -7 1.5e-08       NA     2.5 0.84    NA  NA    NA
9  0.48 0.72 1.000  -9 0.0e+00       NA     2.5 0.84    NA  NA    NA
10 0.48 0.72    NA -10 0.0e+00       NA     2.5 0.84    NA  NA    NA
   complex  eChisq    SRMR eCRMS eBIC
1      1.0 1.4e+03 1.2e-01 0.141 1154
2      1.1 1.4e+02 3.9e-02 0.051  -42
3      1.4 7.1e+01 2.8e-02 0.044  -53
4      1.3 2.4e+01 1.6e-02 0.033  -52
5      1.4 8.5e+00 9.7e-03 0.029  -26
6      1.6 1.2e-01 1.2e-03    NA   NA
7      1.9 1.5e-08 4.1e-07    NA   NA
8      1.9 8.6e-09 3.1e-07    NA   NA
9      1.9 3.2e-17 1.9e-11    NA   NA
10     1.9 3.2e-17 1.9e-11    NA   NA

Now we are provided a variety of different metrics for determining the number of factors to retain. We can see that the suggested number of factors range from 2 (very simple structure with complexity 1 and MAP) to 5 (very simple structure with complexity 2). Given the results of our parallel analysis and our sample size adjusted BIC, we could make a good argument for retaining 4 factors.

Rotation

In some previous courses, you learned about PCA and eigenvalues/eigenvectors. You might remember how we rotate axes to try to get our components to fit together better. Generally, PCA forces an orthogonal rotation – in other words, the vertices will always maintain 90s. Factor analysis, will allow for orthogonal rotations, but also oblique rotations (the vertices can go above or below 90). While pictures certainly help, there is something very important at play here. As you remember from PCA, an orthogonal rotation would hold that the factors (or components) are not correlated. For components, this makes sense. For our scale, however, would you guess that the items being analzyed are not correlated? By using an oblique rotation, we are allowing the factors to be as correlated as necessary.

Our rotations, though not terribly different, will produce different loading patterns.

orthRotation = fa(r = testData, nfactors = 2, rotate = "varimax")

obliqueRotation = fa(r = testData, nfactors = 2, rotate = "promax")

orthRotation
Factor Analysis using method =  minres
Call: fa(r = testData, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
     MR1   MR2   h2   u2 com
A1  0.01 -0.39 0.15 0.85 1.0
A2  0.12  0.67 0.46 0.54 1.1
A3  0.10  0.74 0.56 0.44 1.0
A4  0.22  0.45 0.25 0.75 1.4
A5  0.13  0.60 0.37 0.63 1.1
C1  0.55  0.03 0.31 0.69 1.0
C2  0.63  0.08 0.40 0.60 1.0
C3  0.54  0.11 0.31 0.69 1.1
C4 -0.64 -0.10 0.42 0.58 1.1
C5 -0.57 -0.14 0.34 0.66 1.1

                       MR1  MR2
SS loadings           1.81 1.77
Proportion Var        0.18 0.18
Cumulative Var        0.18 0.36
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  2.03 with Chi Square of  5664.89
The degrees of freedom for the model are 26  and the objective function was  0.17 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

The harmonic number of observations is  2762 with the empirical chi square  397.07  with prob <  5e-68 
The total number of observations was  2800  with Likelihood Chi Square =  468.37  with prob <  1.2e-82 

Tucker Lewis Index of factoring reliability =  0.864
RMSEA index =  0.078  and the 90 % confidence intervals are  0.072 0.084
BIC =  261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR1  MR2
Correlation of (regression) scores with factors   0.85 0.86
Multiple R square of scores with factors          0.72 0.75
Minimum correlation of possible factor scores     0.45 0.49
obliqueRotation
Factor Analysis using method =  minres
Call: fa(r = testData, nfactors = 2, rotate = "promax")
Standardized loadings (pattern matrix) based upon correlation matrix
     MR1   MR2   h2   u2 com
A1  0.08 -0.41 0.15 0.85 1.1
A2  0.00  0.68 0.46 0.54 1.0
A3 -0.03  0.76 0.56 0.44 1.0
A4  0.14  0.44 0.25 0.75 1.2
A5  0.03  0.60 0.37 0.63 1.0
C1  0.57 -0.07 0.31 0.69 1.0
C2  0.64 -0.02 0.40 0.60 1.0
C3  0.54  0.02 0.31 0.69 1.0
C4 -0.65  0.01 0.42 0.58 1.0
C5 -0.56 -0.05 0.34 0.66 1.0

                       MR1  MR2
SS loadings           1.81 1.77
Proportion Var        0.18 0.18
Cumulative Var        0.18 0.36
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

 With factor correlations of 
     MR1  MR2
MR1 1.00 0.34
MR2 0.34 1.00

Mean item complexity =  1
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  2.03 with Chi Square of  5664.89
The degrees of freedom for the model are 26  and the objective function was  0.17 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

The harmonic number of observations is  2762 with the empirical chi square  397.07  with prob <  5e-68 
The total number of observations was  2800  with Likelihood Chi Square =  468.37  with prob <  1.2e-82 

Tucker Lewis Index of factoring reliability =  0.864
RMSEA index =  0.078  and the 90 % confidence intervals are  0.072 0.084
BIC =  261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR1  MR2
Correlation of (regression) scores with factors   0.86 0.88
Multiple R square of scores with factors          0.75 0.77
Minimum correlation of possible factor scores     0.49 0.54

We can see that our oblique rotation produces a correlation of .34 between the two factors. I will leave it up to you to determine if that is a strong enough correlation to warrant an orthogonal rotation. An orthogonal rotation, while limiting the correlation between factors, does produce “cleaner” results. If the factors are not correlated, we can interpret each one in isolation. In an oblique rotation with correlated factors, we have to interpret the factors simultaneously and consider how items loading on multiple factors behave.

Factor Loadings

We can think of factor loadings as the correlation between an item and a factor – they are interpreted in this manner.

Let’s take a peak at the factor loadings of our oblique solution from above:

obliqueRotation$loadings

Loadings:
   MR1    MR2   
A1        -0.407
A2         0.678
A3         0.761
A4  0.144  0.437
A5         0.603
C1  0.574       
C2  0.640       
C3  0.544       
C4 -0.652       
C5 -0.564       

                 MR1   MR2
SS loadings    1.806 1.768
Proportion Var 0.181 0.177
Cumulative Var 0.181 0.357

We can see that our output is hiding values of small magnitude (refer to the previous results for all values) – this is for ease in examining the loading matrix and we could consider these as neglible loadings given the weakness of the loading. We can read these very much in the same way that we read correlations. For factor 1 (MR1 in the output), we see that items C1 through C5 load pretty strongly (with A4 having a weak loading). So, someone who offers a high response on C5 might have lower values of whatever latent trait is being measured, while they might have high values of the latent trait if they respond with a higher value on question C2.

Factor Scoring

We have gone through the necessary steps of performing our factor analysis to this point, but what do we ultimately get out of it? In the end, we want to take our factors and produce some type of score.

What kind of score should we produce? If we use the numeric values given by the observed variables (e.g., 1-5, 0-4), then we can imagine producing some type of aggregate score (i.e., a sum or an average score). If you ever took an undergraduate Psychology course, you have probably already done something like this:

testData = testData %>% 
  rowwise() %>% 
  mutate(agreeAverage = (sum(A1, A2, A3, A4, A5, na.rm = TRUE) / 5), 
         consAverage = (sum(C1, C2, C3, C4, C5, na.rm = TRUE) / 5))

Do those aggregate scores tell the complete truth? Remember how we just talked about loadings? The factor loadings are essentially telling us how much the variable is related to the factor. If we use a simple aggregate score, is that relationship captured? I am afraid not; aggregate scores are going to give the same weight to every item, regardless of how highly it loads on any given factor. What about cross-loadings? This type of aggregate scoring would completely ignore them!

Instead, we can use factor scores. There are several different types of factor scores that we can compute, but we are going to compute Thurstone scores.

Thurstone score are incredibly easy to produce by hand, so let’s give it a try.

To compute those scores, we need to produce a correlation matrix of our observed scores. For simplicity, let see what a one factor model would look like.

agreeDat = bfi %>% 
  select(starts_with("A", ignore.case = FALSE))

agreeCorrs = cor(agreeDat, use = "pairwise")

Then, we need to get the loadings from our factor analysis results:

agreeFA = fa(agreeCorrs, rotate = "promax", scores = "regression")

agreeLoadings = agreeFA$loadings[1:5, 1]

Now, we have everything that we need: item loadings, item correlations, and observed scores.

The first step is to get the matrix product between the inverse correlation matrix and the factor loading matrix:

w = solve(agreeCorrs, agreeLoadings)

Finally, we center and scale the observed data, and do matrix multiplication for the product w that we just found:

agreeScaled = agreeDat %>% 
  scale(., center = TRUE, scale = TRUE)

facScores = agreeScaled %*% w

head(facScores)
            [,1]
61617 -0.8598383
61618 -0.2435214
61620 -0.4336288
61621  0.2284224
61622 -0.9461108
61623  0.3984620

We can check these against those returned from our results:

head(predict(agreeFA, agreeDat))
             MR1
61617 -0.8598383
61618 -0.2435214
61620 -0.4336288
61621  0.2284224
61622 -0.9461108
61623  0.3984620

Now, let’s see how well those scores are correlated with simple aggregate scores.

simpleScores = agreeDat %>% 
  mutate(simpleScore = rowMeans(.))
cor(facScores[, 1], simpleScores$simpleScore, use = "complete")
[1] 0.8537173

We can see that the scores are certainly correlated, but the factor scores are likely closer to true scores.

Reliability

Factor analysis falls under a broader notion of what is called Classical Testing Theory (CTT). In CTT, the notion of reliability is of supreme importance. Reliability is most easily defined as being able to produce the same result from a measure. If I give you the same measure a few months apart and you score similarly on both measures, then we are likely dealing with a reliable test. It is important to note that reliable is not the same as valid. Validity, with regard to measurement, is knowing that our measure is measuring what we intend it to measure (e.g., our measure of likelihood to engage in whistleblowing is actually measuring a person’s likelihood to blow the whistle). A measure can be reliable without being valid – remember it just has to produce the same results time after time.

Assessing reliability is generally an easy task. The most simple version of reliability, is Cronbach’s measure of internal consistency, or \(\alpha\). Cronbach’s \(\alpha\) is simply the best split-half correlation within a measure (internal consistency means that the whole measure is reliable within itself).

Getting \(\alpha\) is easy enough:

psych::alpha(agreeDat, check.keys = TRUE)

Reliability analysis   
Call: psych::alpha(x = agreeDat, check.keys = TRUE)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
       0.7      0.71    0.68      0.33 2.5 0.009  4.7 0.9     0.34

 lower alpha upper     95% confidence boundaries
0.69 0.7 0.72 

 Reliability if an item is dropped:
    raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
A1-      0.72      0.73    0.67      0.40 2.6   0.0087 0.0065  0.38
A2       0.62      0.63    0.58      0.29 1.7   0.0119 0.0169  0.29
A3       0.60      0.61    0.56      0.28 1.6   0.0124 0.0094  0.32
A4       0.69      0.69    0.65      0.36 2.3   0.0098 0.0159  0.37
A5       0.64      0.66    0.61      0.32 1.9   0.0111 0.0126  0.34

 Item statistics 
       n raw.r std.r r.cor r.drop mean  sd
A1- 2784  0.58  0.57  0.38   0.31  4.6 1.4
A2  2773  0.73  0.75  0.67   0.56  4.8 1.2
A3  2774  0.76  0.77  0.71   0.59  4.6 1.3
A4  2781  0.65  0.63  0.47   0.39  4.7 1.5
A5  2784  0.69  0.70  0.60   0.49  4.6 1.3

Non missing response frequency for each item
      1    2    3    4    5    6 miss
A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01

We get a fair chunk of output, but let’s just focus on the “std.alpha” value of .71. By most “rules-of-thumb”, anything of .7 is acceptable for general purpose use. If we were wanting to make very important decisions with our measure, we might look for something approaching or exceeding .9.

You likely noticed the “check.keys = TRUE” arguement – alpha is bound by test length and item correlations. In a case where some items are negatively correlated, that will greatly reduce our \(\alpha\) coefficient. Let’s try it for ourselves:

psych::alpha(agreeDat)
Some items ( A1 ) were negatively correlated with the total scale and 
probably should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option

Reliability analysis   
Call: psych::alpha(x = agreeDat)

  raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
      0.43      0.46    0.53      0.15 0.85 0.016  4.2 0.74     0.32

 lower alpha upper     95% confidence boundaries
0.4 0.43 0.46 

 Reliability if an item is dropped:
   raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
A1      0.72      0.73    0.67     0.398 2.64   0.0087 0.0065 0.376
A2      0.28      0.30    0.39     0.097 0.43   0.0219 0.1098 0.081
A3      0.18      0.21    0.31     0.061 0.26   0.0249 0.1015 0.081
A4      0.25      0.31    0.44     0.099 0.44   0.0229 0.1607 0.105
A5      0.21      0.24    0.36     0.072 0.31   0.0238 0.1311 0.095

 Item statistics 
      n raw.r std.r r.cor r.drop mean  sd
A1 2784 0.066 0.024 -0.39  -0.31  2.4 1.4
A2 2773 0.630 0.666  0.58   0.37  4.8 1.2
A3 2774 0.724 0.742  0.72   0.48  4.6 1.3
A4 2781 0.686 0.661  0.50   0.37  4.7 1.5
A5 2784 0.700 0.719  0.64   0.45  4.6 1.3

Non missing response frequency for each item
      1    2    3    4    5    6 miss
A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01

Very different, right?

Let’s calculate it on our own. We will need to recode our negatively correlated variable (A1) first.

agreeDatRecode = agreeDat %>% 
  mutate(A1 = abs(A1-7)) # This recodes our scale direction

nvar = length(agreeDatRecode)

# This computes the covariance matrix of our data.

C = cov(agreeDatRecode, use = "complete.obs")

n = dim(C)[2]

total = rowMeans(agreeDatRecode, na.rm = TRUE)

# The tr function is just adding everything on the 
# diagonal from our covariance matrix.

alpha.raw = (1 - psych::tr(C)/sum(C)) * (n/(n - 1))

How did we do? We perfectly replicated our raw alpha value from the output. This is an easy example, but more elaborate ones don’t take much more effort.

We did all of this to say that \(\alpha\) helps us to know how consistent a one factor measure is within itself. In many circumstances where you find yourself with a unidimensional factor structure, \(\alpha\) will be more than sufficient for convincing people that your scale is reliable (weaknesses aside).

There are, however, many other forms of reliability (\(\alpha\) is based off of something else called the Kuder-Richardson 20). One popular and conceptually helpful form of reliability is McDonald’s \(\omega\). McDonald’s \(\omega\) is well-suited for hierarchical factor structures, in which you might have various subscales that assess something bigger.

Earlier, we played with the bfi data. Let’s return to that, but with a hypothesis that the 5 factors of the bfi are actually subfactors of some larger latent personality variable.

bfiSubFactors = bfi %>% 
  select(-gender, -education, -age)

omegaTest = psych::omega(bfiSubFactors, nfactors = 5, 
             rotate = "promax", plot = FALSE)

omega.diagram(omegaTest, sl = FALSE)

This is what our factor structure would look like in a hierarchical fashion. We have the general factor (g) and our 5 subfactors (also called grouping factors). We can see how strongly our subfactors load onto our general factor with the provided values. We can also see how strongly each individual items loads onto a subfactor.

We can also look at some reliability values (and a cleaner loading matrix):

omegaTest
Omega 
Call: psych::omega(m = bfiSubFactors, nfactors = 5, plot = FALSE, rotate = "promax")
Alpha:                 0.82 
G.6:                   0.86 
Omega Hierarchical:    0.52 
Omega H asymptotic:    0.6 
Omega Total            0.87 

Schmid Leiman Factor loadings greater than  0.2 
        g   F1*   F2*   F3*   F4*   F5*   h2   u2   p2
A1-                          0.40       0.19 0.81 0.09
A2   0.37                    0.54       0.45 0.55 0.30
A3   0.43                    0.55       0.52 0.48 0.35
A4   0.32                    0.36       0.28 0.72 0.37
A5   0.45        0.23        0.44       0.46 0.54 0.45
C1   0.30              0.45             0.33 0.67 0.28
C2   0.33              0.56             0.45 0.55 0.24
C3   0.29              0.48             0.32 0.68 0.26
C4-  0.37              0.50             0.45 0.55 0.31
C5-  0.40 -0.21        0.45             0.43 0.57 0.38
E1-  0.36        0.44                   0.35 0.65 0.38
E2-  0.51        0.50                   0.54 0.46 0.48
E3   0.46        0.38              0.21 0.44 0.56 0.49
E4   0.51        0.46                   0.53 0.47 0.48
E5   0.48        0.34                   0.40 0.60 0.57
N1-  0.20 -0.78              0.22       0.65 0.35 0.06
N2-  0.20 -0.75              0.20       0.60 0.40 0.06
N3-  0.21 -0.71                         0.55 0.45 0.08
N4-  0.37 -0.51  0.21                   0.49 0.51 0.27
N5-  0.20 -0.51                         0.35 0.65 0.11
O1   0.28                          0.47 0.31 0.69 0.25
O2-                                0.45 0.26 0.74 0.06
O3   0.34        0.20              0.55 0.46 0.54 0.25
O4-                               -0.36 0.25 0.75 0.01
O5-                                0.53 0.30 0.70 0.07

With eigenvalues of:
  g F1* F2* F3* F4* F5* 
2.8 2.5 1.2 1.3 1.3 1.3 

general/max  1.15   max/min =   2.1
mean percent general =  0.27    with sd =  0.16 and cv of  0.61 
Explained Common Variance of the general factor =  0.27 


Compare this with the adequacy of just a general factor and no group factors
The degrees of freedom for just the general factor are 275  and the fit is  4.42 
The number of observations was  2800  with Chi Square =  12321.03  with prob <  0
The root mean square of the residuals is  0.13 
The df corrected root mean square of the residuals is  0.14 

RMSEA index =  0.125  and the 10 % confidence intervals are  0.123 0.127
BIC =  10138.25 

Measures of factor score adequacy             
                                                 g  F1*   F2*  F3*  F4*
Correlation of scores with factors            0.76 0.92  0.69 0.76 0.80
Multiple R square of scores with factors      0.58 0.85  0.47 0.59 0.63
Minimum correlation of factor score estimates 0.17 0.70 -0.06 0.17 0.27
                                               F5*
Correlation of scores with factors            0.81
Multiple R square of scores with factors      0.66
Minimum correlation of factor score estimates 0.32

 Total, General and Subset omega for each subset
                                                 g  F1*  F2*  F3*  F4*
Omega total for total scores and subscales    0.87 0.84 0.77 0.73 0.70
Omega general for total scores and subscales  0.52 0.10 0.42 0.24 0.25
Omega group for total scores and subscales    0.25 0.75 0.35 0.50 0.45
                                               F5*
Omega total for total scores and subscales    0.49
Omega general for total scores and subscales  0.12
Omega group for total scores and subscales    0.37

While we see that we can get an \(\alpha\) value, we also have Guttman’s \(\lambda_6\), and 3 different \(\omega\) values that are of interest to us. Since we are dealing with something of a hierarchical nature, we already know that \(\alpha\) and \(\lambda_6\) won’t be appropriate. \(\omega_h\) gives us an idea about the reliability of the general factor (perhaps not super in this case) and \(\omega_t\) gives us the total reliability of the test (very good).

We can also see item loadings for the general factor and each subfactor. Generally, we find that the items that we would anticipate loading together (e.g., all N items load together, and all O items load together), along with some cross-loading items from other subfactors.

Item Response Theory

Item Response Theory And Factor Analysis

Last week, we learned about latent variables and how we can use factor analysis to construct measures for assessing latent traits. While we are still dealing in the world of latent variables, we are taking a new approach to measuring those latent variables – Item Response Theory (IRT). While both allow us to measure a person’s level of a latent trait, IRT gives us the power to understand the person and the items. We can recall that factor analysis falls under the broader heading of Classical Test Theory (CTT) – the name is a clear indication that the focus is on the test as a whole, not individual items.

In addition to this broadened understanding, IRT has a different focus from a measurement perspective than factor analysis. Perhaps most important is the notion of measurement error in IRT. In factor analysis, measurement error is assumed to be consistent across each item; this is not the case for IRT.

Dichotomous And Polytomous Models

Before we dive into some of the more important parts of IRT, it is important to draw some distinctions between two broad families of models: dichotomous and polytomous. A dichotomous model can best be thought of in testing situations where a correct answer exists (you either get the question correct or not, so the outcome is dichotomous). With that knowledge, you might have already guessed what the polytomous model entails – items that have ordered responses with no clear correct choices (Likert response options would be a good example). Polytomous models include such models as the graded response model and the partial credit model.

We will see how these models behave differently as we continue discussing the different parts of IRT models.

Ability

In IRT, we talk a person’s ability; essentially it is the level of the latent variable that a person possesses. In the parlance of psychometrics, ability is denoted as \(\theta\). The \(\theta\) scale is standardized with a mean of 0 and tends to go from -4 to 4 (while this is generally seen, it might not always be the case).

Different Parameters

The ability to explore individual items within IRT is based upon three different parameters: location, discrimination, and psuedoguessing. We can incorporate any of those parameters into our model. First, though, we need to understand what those parameters mean.

Location

In IRT, the location parameter (denoted as \(\\b_i\)) described the difficulty of the item. If we are talking about a testing situation (e.g., a math or science test), the meaning of \(\\b_i\) is nearly self-evident – it is simply the probability of a correct response to the question in a dichotomous model and the probability of moving from one category to another in polytomous models (the level of passing from one category to the next is called a threshold).

Let’s look at some examples from the mirt documentation of different item locations. The ltm package is also a great package for IRT models.

library(mirt)

library(ltm)

# The model arguement set to 1 means we are just looking
# at a one factor model.

fit1PL = mirt(LSAT, model = 1, itemtype = "Rasch", verbose = FALSE)

plot(fit1PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE)

The resulting plot is what is known as an Item Characteristic Curve. It contains a lot of information, but we are only going to pay attention to the location right now. Without any prior knowledge, which of the labelled curves would say represents the most difficult item? If you said “3” (the black line), then you are absolutely correct. When we look at the location of an item in an ICC graph, we are looking for its placement along the x-axis around its middle point (i.e., find the middle of a curve and draw a line straight down – the farther to the right, the more difficult the item). With item 3, we can see that it takes someone with slightly more than average ability to have a better than chance probability of responding correctly.

Discrimination

Item discrimination, the \(a_i\) parameter, demonstrates the likelihood of people with varying degrees of ability to respond correctly.

Let’s specify a slightly different model and take a look at our ICC again:

fit2PL = mirt(LSAT, model = 1, itemtype = "2PL", verbose = FALSE)

plot(fit2PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE)

In our last model, we saw that all of the items had the exact same curve – this is no longer the case. If we pay attention to item 3 again, we see that a person with very low ability (\(\theta = -4\)) has nearly a probability of 0 to respond correctly, someone with average ability (\(\theta = 0\)) has around a 50-50 chance of responding correctly, and high ability people (\(\theta = 4\)) have nearly 100% chance of responding correctly. We would definitely say that item 3 discriminates between different ability levels pretty well (we will see if it is the most discriminating item later). Which item might not discriminate too well? Without doubt, you said item 1, and you would be correct. While people at lower ability levels have less chance of responding correctly to item 1, we can see that a probability of 1 is obtained pretty soon after crossing \(\theta = 0\). So while item 1 may provide a bit of discrimination at the lower end of \(\theta\), it does nothing when dealing average or more. This provides even more evidence as to why item 3 is such a good item – it discriminates very well along every point of \(\theta\)

So, we have a pretty good idea that item 3 is the most difficult item and the most discriminating item. This, hopefully, provides a demonstration on why IRT models are so powerful – we can compare individual items to people (on the same scales, no less).

Guessing

The third parameter, \(c_i\), is the psuedoguessing parameter. Even the blind squirrel finds a nut and \(c_i\) accounts for this. Essentially, the guessing parameter offers a slight penalty for the chance of guessing a correct answer.

Models And Parameters

1PL

Now that we know what our parameters mean, we can start putting them into models. The one parameter logistic model (1PL) only estimates item difficulty (\(b_i\)). It assumes that all items discriminate equally and that all items have the same guessing parameter.

Although some minor differences, you will often see the 1PL and Rasch model used interchangibly.

2PL

The two parameter logistic model (2PL) adds the discrimination parameter, \(a_i\), into the model.

3PL

The three parameter logistic model (3PL) adds in the guessing parameter, \(c_i\). You might wonder how in the world a guessing parameter is added! If someone could guess the correct answer (think about a multiple choice test), would the probability of a correct answer ever be 0? I would think not. So, \(c_i\) just includes a lower-bound asymtote.

fit3PL = mirt(LSAT, model = 1, itemtype = "3PL", verbose = FALSE)

plot(fit3PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE)

If we compare the ICC plots we saw for the 1PL and 2PL earlier, we see a a significant difference between the lower asymptote.

par(mfrow = c(1, 3))

plot(fit1PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE, main = "1PL")

plot(fit2PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE, main = "2PL")

plot(fit3PL, type = 'trace', which.items = 1:5, 
     facet_items = FALSE, main = "3PL")

While we see that the probability for a low ability individual obtaining a correct answer for item 3 is still very low, it has improved slightly.

With all of these parameters, the probability of a correct response can be expressed as follows: \[p_i(\theta) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta - b_i)} }\]

Scoring

Scoring in IRT models works a bit differently than scoring in CTT.

An element common to both IRT and CTT is the true score, \(true_{score} = observed_{score} + e\). The distinction rests in how they treat \(e\) – CTT assumes that error is the same for everyone and IRT allows for error to vary freely across people.

IRT scores also have the added benefit of using the item response functions in the computation of scores.

We can see the factor scores returned from our model:

facScores = fscores(fit3PL)

head(facScores)
            F1
[1,] -1.847728
[2,] -1.847728
[3,] -1.847728
[4,] -1.444931
[5,] -1.444931
[6,] -1.444931
summary(facScores)
       F1            
 Min.   :-1.8477281  
 1st Qu.:-0.4680632  
 Median : 0.0099887  
 Mean   : 0.0000019  
 3rd Qu.: 0.6527025  
 Max.   : 0.6527025  

And let’s see what scores from a factor analysis would look like compared to these scores:

faScores = psych::fa(LSAT, 1, rotate = "promax")$scores

summary(faScores)
      MR1          
 Min.   :-2.02900  
 1st Qu.:-0.44756  
 Median : 0.04501  
 Mean   : 0.00000  
 3rd Qu.: 0.63895  
 Max.   : 0.63895  
cor(faScores, facScores)
           F1
MR1 0.9987849
plot(faScores, facScores)

We can see some very clear similarities and an undeniable correlation between our two different factor scores. While certainly similar, we can be pretty confident that our IRT scores are closer to true scores.

We can also put those scores back into our data to see how they correlate with the other items:

library(dplyr)

LSAT %>% 
  mutate(scores = facScores) %>% 
  cor() %>% 
  GGally::ggcorr()

We can see that overall scores are most correlated with item 3, followed by item 2, and item 4. If we recall our ICC plot from above, this probably should not surprise us too much – item 3 is definitely the best item, while items 1 and 5 don’t look as nice.

Different Measurement Levels

So far, we have discussed models that are largely dealing with dichotomous variables (largely incorrect and correct). These are aptly names dichotomous models. However, we know that not all measures have a binary response. If we have multiple response options (perhaps a Likert response option format), then we need to use a polytomous model. One common model is the Graded Response Model (GRM).

The GRM is used with the express purpose of ordinal, polytomous response options.

Let’s take a look:

library(psych)

library(dplyr)

testData = bfi %>% 
  dplyr::select(starts_with("C")) %>% 
  na.omit()

library(ggridges)
library(viridis)
library(ggplot2)

testData %>% 
  tidyr::gather(key = question, value = score) %>% 
  ggplot(., aes(score, question, fill = ..x..)) +
  geom_density_ridges_gradient() +
  scale_fill_viridis(option = "B") +
  theme_minimal()

grmMod = mirt(testData, model = 1, itemtype = "graded", verbose = FALSE)

coef(grmMod, simplify = TRUE, IRTpars = TRUE)
$items
        a     b1     b2     b3     b4     b5
C1  1.443 -3.157 -2.185 -1.419 -0.347  1.200
C2  1.605 -2.771 -1.725 -1.090 -0.154  1.254
C3  1.314 -3.189 -1.935 -1.232 -0.033  1.573
C4 -1.877  0.789 -0.251 -0.877 -1.721 -2.773
C5 -1.386  1.449  0.443 -0.051 -0.969 -2.034

$means
F1 
 0 

$cov
   F1
F1  1

In our coefficient output, we see coefficients for each item (b1:b5). These coefficients detail the thresholds for moving from one response level to the next (we also see the discrimination value).

Just like our dichotomous models, we can also see our item characteristic curves.

plot(grmMod, type = 'trace', which.items = 1:5, 
     facet_items = TRUE, main = "GRM")

We can see that these are very different looking ICC plots then what we saw before. Remember, that we are dealing with different models here. Instead of the probabililty of incorrect or correct, we are now dealing with the probability of moving from one response category to the next. To that end, each item’s ICC has a curve for each response option over ability. If we look at the ICC for C1, we can see that someone offering a response of 1 on the scale would have a higher probability of having a lower ability. Conversely, someone offering a response of 6 would have a higher probability of being high ability.

Mixed Models

Packages

We will need the following:

install.packages(c("lme4", "lmerTest", "merTools"))

Terminology

While the specifics of each model you have learned to this point might take some time to get our heads all the way around, the terminology has largely been pretty clear – no more. You will hear “mixed models”, “mixed effects models”, “hierarchical linear models”, “nested models”, and/or “multilevel models”; these are all slight variations on a common theme. For the sake of our work here, we will keep it at mixed models. Within our mixed model, we have an additional source of cloudiness: fixed and random effects. The random effects don’t pose much of an issue (we will define it later), but fixed effects have 4 different definitions depending upon whom you ask. For the sake of simplicity (again), we are going to consider fixed effects as an effect on the individual unit of analysis. This will all start to make sense once we take a look at the models.

Standard Linear Model

For the sake of conceptual grounding, let’s go back to our standard linear model:

library(dplyr)

library(ggplot2)

healthData = readr::read_csv("https://www.nd.edu/~sberry5/data/healthViolationsDistances.csv")

healthData = healthData %>% 
  mutate(BORO = as.factor(.$BORO), 
         cuisine = as.factor(.$`CUISINE DESCRIPTION`), 
         distanceCentered = dohDistanceMeter - mean(dohDistanceMeter))

lmTest = lm(SCORE ~ distanceCentered, data = healthData)

ggplot(healthData, aes(SCORE, distanceCentered)) +
  geom_point() + 
  geom_smooth(method = "lm")

summary(lmTest)

Call:
lm(formula = SCORE ~ distanceCentered, data = healthData)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.249  -9.312  -3.146   5.755  95.787 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.104e+01  6.306e-02 333.615  < 2e-16 ***
distanceCentered -4.905e-05  1.354e-05  -3.622 0.000293 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.8 on 47907 degrees of freedom
Multiple R-squared:  0.0002737, Adjusted R-squared:  0.0002529 
F-statistic: 13.12 on 1 and 47907 DF,  p-value: 0.000293

We have our standard output here. As before, our intercept is the average score when there is zero distance between the restaurant and department of health and our coefficient for distance is telling us that for every mile increase in distance, we are increasing our score by some tiny amount. We know that we are ignoring some information within our model, namely the clustering that occurs based upon cuisine and/or borough.

Let’s take a quick look at how each of the different boroughs might behave in a model:

ggplot(healthData, aes(SCORE, distanceCentered, group = BORO)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap( ~ BORO)

We can see that we have very different effects for each level of BORO.

Why Mixed Models

Given what we saw in the preceding figure, mixed models will allow us to use the information for each level of BORO in the model.

Mixed models don’t get bogged down by large groups and the smaller groups do not get buried by the larger groups (in linear models, you likely learned about balance in t-tests; mixed models will attenuate the effects of group imbalance.) and mixed models will not overfit or underfit when we have repeated samples/measurement

Random Intercepts Model

Let’s include borough in our model. We are not going to add it as another predictor, but we are going to include it as another level to our model. The lme4 package will make this very easy:

library(lme4)

riMod = lmer(SCORE ~ distanceCentered + (1|BORO), data = healthData)

Before we look at the summary for this model, let’s get an idea about what is happening in the syntax. The first part of our formula should look familiar – these are the global estimates (fixed effects) within our model and will behave exactly the same as our standard linear model.

The next part in the parentheses is how we denote our random effect. Whenever you see a 1 included in a formula interface, we can be pretty comfortable that it is in reference to a intercept. The | specifies a grouping. With that information, we might be able to guess that we are specifying a random intercept for each borough.

We should probably check out the summary:

summary(riMod)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ distanceCentered + (1 | BORO)
   Data: healthData

REML criterion at convergence: 387368.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6457 -0.7010 -0.2296  0.4065  6.9165 

Random effects:
 Groups   Name        Variance Std.Dev.
 BORO     (Intercept)   0.9879  0.9939 
 Residual             189.9733 13.7831 
Number of obs: 47909, groups:  BORO, 5

Fixed effects:
                  Estimate Std. Error t value
(Intercept)      2.076e+01  4.592e-01  45.204
distanceCentered 1.078e-05  1.761e-05   0.612

Correlation of Fixed Effects:
            (Intr)
distncCntrd -0.111
fit warnings:
Some predictor variables are on very different scales: consider rescaling

We have our standard output and we can see that our coefficients have changed from our standard linear model – this change is purely due to the severe imbalance between our groups.

We can see the imbalance between the groups by using the group_by and summarize:

healthData %>% 
  group_by(BORO) %>% 
  summarize(n())
# A tibble: 5 x 2
  BORO          `n()`
  <fct>         <int>
1 BRONX          4083
2 BROOKLYN      11079
3 MANHATTAN     20250
4 QUEENS        11558
5 STATEN ISLAND   939

We can clearly see that we have large disparities between our groups.

When groups are largely balanced, we would find that our coefficients would be the same (or very close to it).

What should almost always change is our standard errors – by integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level.

We also see some additional information – this is for our random effects. The standard deviation is telling us how much the score moves around based upon borough after getting the information from our fixed effects (i.e., the health score can move around nearly 1 whole point from boro alone). We can compare the standard deviation for BORO to the coefficient for distanceCentered – Borough is contributing to more variability within Scores than distance. We can also add the variance components and divide by the random effects variance to get its variance account for.

.9879 / (.9879 + 189.9733)
[1] 0.005173302

So while it might be doing more than what distance does, borough is not accounting for too much variance.

We can also plot simulated random effect ranges for each of the random effect groups. We want to pay attention to those that are highlighed with black (i.e., the range does not cross the red line at 0).

library(merTools)

plotREsim(REsim(riMod))

levels(healthData$BORO)
[1] "BRONX"         "BROOKLYN"      "MANHATTAN"     "QUEENS"       
[5] "STATEN ISLAND"

In examining the plot, we see that the random effect ranges for the Bronx and Staten Island have significant effects on health score.

Going back to the output, did you notice anything missing: p-values! Estimating p-values in a mixed model is exceedingly difficult because of varying group sizes, complete sample n, and how those relate to reference distributions. If you need something that will help, you can get confidence intervals in the same way that you would anything else:

confint(riMod)
                         2.5 %       97.5 %
.sig01            5.025929e-01 1.961607e+00
.sigma            1.369612e+01 1.387068e+01
(Intercept)       1.978061e+01 2.174178e+01
distanceCentered -2.375822e-05 4.505094e-05

If you really want to see p-values, you can get them easily:

riModP = lmerTest::lmer(SCORE ~ distanceCentered + (1|BORO), data = healthData)

summary(riModP)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: SCORE ~ distanceCentered + (1 | BORO)
   Data: healthData

REML criterion at convergence: 387368.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6457 -0.7010 -0.2296  0.4065  6.9165 

Random effects:
 Groups   Name        Variance Std.Dev.
 BORO     (Intercept)   0.9879  0.9939 
 Residual             189.9733 13.7831 
Number of obs: 47909, groups:  BORO, 5

Fixed effects:
                  Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)      2.076e+01  4.592e-01 4.133e+00  45.204 9.96e-07 ***
distanceCentered 1.078e-05  1.761e-05 3.455e+03   0.612     0.54    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
distncCntrd -0.111
fit warnings:
Some predictor variables are on very different scales: consider rescaling

NOTE: I would never load the lmerTest package, but would attach with colons! If you load it, you will find that it masks things from lme4 that you don’t want to have masked (i.e., lmer) and they are not equivalent objects!

Getting predicted values from our mixed model is no different then getting them from any other model:

mixedPred = predict(riMod)

slimPred = predict(lmTest)

head(cbind(actual = healthData$SCORE, 
      mixed = mixedPred, 
      slim = slimPred), 20)
   actual    mixed     slim
1      18 21.40671 20.67657
2      39 21.33549 21.00055
3      12 21.38798 20.28845
4      28 21.34376 20.48960
5      10 21.28248 21.24163
6       4 21.32554 21.04579
7       9 19.45224 20.67354
8      18 21.68792 21.12855
9      25 21.65763 21.26634
10     19 20.17974 20.97907
11      7 20.21237 20.83064
12     17 20.10837 21.30368
13     10 20.12837 21.21274
14      9 20.12701 21.21891
15     38 20.19149 20.92563
16     19 21.68716 21.13201
17     11 21.66144 21.24901
18     11 21.70010 21.07315
19     18 21.66504 21.23266
20     17 21.33218 21.01559

While there were cases where the standard linear model did a slightly better job, our mixed model generally did a better job (even if marginally so).

Let’s add some more information to our model. As we dive into our data, we will notice that we also have cuisine groupings. We can add this additional grouping into our model:

clusterMod = lmer(SCORE ~ distanceCentered + (1|cuisine) + (1|BORO), data = healthData)

This is often called a cross-classified model.

summary(clusterMod)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ distanceCentered + (1 | cuisine) + (1 | BORO)
   Data: healthData

REML criterion at convergence: 386632.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8661 -0.6856 -0.2399  0.4194  6.8509 

Random effects:
 Groups   Name        Variance Std.Dev.
 cuisine  (Intercept)   2.901   1.703  
 BORO     (Intercept)   1.867   1.366  
 Residual             186.933  13.672  
Number of obs: 47909, groups:  cuisine, 12; BORO, 5

Fixed effects:
                   Estimate Std. Error t value
(Intercept)       2.132e+01  8.116e-01  26.269
distanceCentered -2.450e-05  1.791e-05  -1.368

Correlation of Fixed Effects:
            (Intr)
distncCntrd -0.067
fit warnings:
Some predictor variables are on very different scales: consider rescaling

Let’s look at our variances how we did earlier:

# cuisine

2.90 / (2.90 + 1.867 + 186.933)
[1] 0.0151278
# BORO

1.867 / (1.867 + 2.90 + 186.933)
[1] 0.009739176
plotREsim(REsim(clusterMod))

We should also see if our predictions improved:

clusterPredict = predict(clusterMod)

head(cbind(actual = healthData$SCORE, 
           clustered = clusterPredict,
           mixed = mixedPred, 
           slim = slimPred), 20)
   actual clustered    mixed     slim
1      18  19.28713 21.40671 20.67657
2      39  19.44896 21.33549 21.00055
3      12  20.07550 21.38798 20.28845
4      28  20.17598 21.34376 20.48960
5      10  23.67651 21.28248 21.24163
6       4  19.47156 21.32554 21.04579
7       9  17.25207 19.45224 20.67354
8      18  24.36511 21.68792 21.12855
9      25  20.32682 21.65763 21.26634
10     19  22.06295 20.17974 20.97907
11      7  17.88168 20.21237 20.83064
12     17  21.14971 20.10837 21.30368
13     10  21.93627 20.12837 21.21274
14      9  22.18274 20.12701 21.21891
15     38  22.03625 20.19149 20.92563
16     19  24.36684 21.68716 21.13201
17     11  20.31816 21.66144 21.24901
18     11  25.39586 21.70010 21.07315
19     18  20.30999 21.66504 21.23266
20     17  19.45647 21.33218 21.01559

For many observations, our predictions definitely go tighter (many are still far off, though).

If we continue to look at our data (and with some knowledge about how NYC does health inspections), we will see that restaurants are rated yearly – let’s use this information in our model. We won’t worry about distance anymore, because now we have a few competing hypotheses. We could imagine two different ways that the works: one in which a restuarant’s score improves as observations increase (it takes some time for the owner to get his staff fully up to speed) or one in which the score decreases as the observations increase (the “shine has worn off the penny”).

Let’s do a bit of data processing first.

healthDataGrouped = healthData %>% 
  tidyr::unite(col = nameLocation, DBA, BUILDING , remove = FALSE) %>% 
  group_by(nameLocation) %>%
  arrange(lubridate::mdy(`GRADE DATE`)) %>% 
  mutate(observation = 1:n())

timeReviewed = healthDataGrouped %>% 
  summarize(n = n()) %>% 
  filter(n > 10)

reviewedRest = healthDataGrouped[which(healthDataGrouped$nameLocation %in% 
                                         timeReviewed$nameLocation), ]
observationMod = lmer(SCORE ~ observation + (1|nameLocation), data = reviewedRest)

In this model, we have a fixed effect for observation and we are allowing each location to have it’s own random intercept. We have essentially created a model that will deal with the repeated measures for each of the locations.

reviewedRest %>% 
  arrange(nameLocation, observation) %>% 
  head(., 350) %>% 
  ggplot(., aes(observation, SCORE, group = nameLocation)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap( ~ nameLocation) +
  theme_minimal()

summary(observationMod)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 | nameLocation)
   Data: reviewedRest

REML criterion at convergence: 345661

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.0067 -0.5765 -0.1311  0.4606  6.6361 

Random effects:
 Groups       Name        Variance Std.Dev.
 nameLocation (Intercept)  29.46    5.428  
 Residual                 125.82   11.217  
Number of obs: 44622, groups:  nameLocation, 1750

Fixed effects:
             Estimate Std. Error t value
(Intercept) 13.948018   0.159359   87.53
observation  0.461932   0.005331   86.65

Correlation of Fixed Effects:
            (Intr)
observation -0.455

Our fixed effect here would indicate that we have a slight increase in scores as our observations increase, but we can also see that scores will bounce around about 5 points on average by location alone.

29.46 / (29.46 + 125.82)
[1] 0.1897218

We see that the location alone accounts for nearly 20% of the variance in health scores.

Hierarchical Models

Hierarchical models are a slight variation on the models that we have just seen. In these models, we have groups nested within other groups. We know that we have a “BORO” variable, but a quick look at our data will show that we also have a DBA variable; this variable is giving the name of the restaurant. One source of inquiry would be to inspect how various chain restaurants, nested within each boro, perform.

Let’s check out how some chain restaurants do within the boroughs.

chainFood = healthDataGrouped %>% 
  filter(DBA == "BURGER KING" |
           DBA == "MCDONALD'S" | 
           DBA == "PIZZA HUT" |
           DBA == "SUBWAY")

The model set-up is just different enough to cause some potential confusion here. Within the parentheses, we have our intercept as before, but we are also saying that we are looking at the DBA groups within the BORO groups.

hierMod = lme4::lmer(SCORE ~ observation + (1|BORO/DBA), 
                     data = chainFood)

summary(hierMod)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 | BORO/DBA)
   Data: chainFood

REML criterion at convergence: 1317.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.9087 -0.5211 -0.1990  0.1679  3.7629 

Random effects:
 Groups   Name        Variance Std.Dev.
 DBA:BORO (Intercept)  11.076   3.328  
 BORO     (Intercept)   1.877   1.370  
 Residual             168.917  12.997  
Number of obs: 165, groups:  DBA:BORO, 9; BORO, 5

Fixed effects:
            Estimate Std. Error t value
(Intercept)   7.5428     2.2741   3.317
observation   1.2320     0.2306   5.342

Correlation of Fixed Effects:
            (Intr)
observation -0.609

We have our fixed effect for observation (indicating that there is a positive relationship between observations and scores) and we have individual intercepts for both BORO and DBA nested within BORO. Looking at the standard deviation of our random effects, we can see that the scores bounce around a little over 3 points from chain to chain. If we explore the variance components, we see that around 6% (11.076 / (11.076 + 1.877 + 168.91)) of the variance in score is handled within our nested random effect.

We can also look at our effect ranges:

plotREsim(REsim(hierMod))

Even though it does not look like there is too much going on in the way of effects here, we still need to know the group(s) that each dot represent(s):

unique(hierMod@flist$BORO)
[1] QUEENS        BROOKLYN      MANHATTAN     STATEN ISLAND BRONX        
Levels: BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
unique(hierMod@flist$`DBA:BORO`)
[1] MCDONALD'S:QUEENS        SUBWAY:BROOKLYN         
[3] MCDONALD'S:MANHATTAN     PIZZA HUT:MANHATTAN     
[5] MCDONALD'S:BROOKLYN      PIZZA HUT:QUEENS        
[7] MCDONALD'S:STATEN ISLAND BURGER KING:BROOKLYN    
[9] MCDONALD'S:BRONX        
9 Levels: BURGER KING:BROOKLYN MCDONALD'S:BRONX ... SUBWAY:BROOKLYN

They appear in the same order in the graph as they do printed out, so we can just map them on. While the interval is large and certainly crosses 0, it seems like the effect for McDonald’s in the Bronx is sizeable.

Random Slopes

Now that we have seen random intercepts and hierarchical models, we can add one final piece: random slopes. In the following model, we will specify a random intercept (recall, it is the 1 within our parenthesis) and a random slope (we are putting the prefixing our grouping variable with a slope of interest). Not only will this model allow the intercept to vary between groups, but it will also allow the slope to vary between groups.

observationMod = lmer(SCORE ~ observation + (1 + observation|nameLocation), data = reviewedRest)

summary(observationMod)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 + observation | nameLocation)
   Data: reviewedRest

REML criterion at convergence: 342130.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.3899 -0.5413 -0.0949  0.4448  7.6038 

Random effects:
 Groups       Name        Variance Std.Dev. Corr 
 nameLocation (Intercept)  40.3173  6.3496       
              observation   0.2546  0.5046  -0.56
 Residual                 109.6194 10.4699       
Number of obs: 44622, groups:  nameLocation, 1750

Fixed effects:
            Estimate Std. Error t value
(Intercept) 12.42496    0.18418   67.46
observation  0.63182    0.01486   42.51

Correlation of Fixed Effects:
            (Intr)
observation -0.648
convergence code: 0
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Nothing changes with regard to our fixed effects, but we get some added information in our random effects. The random intercept variance for each location is telling us the amount that the first score bounces around from place to place (a pretty massive amount, if you ask me) and the observation variance is telling us how much variability there is within the slope between locations.

If we use the predicted values of our model, we can see what our results will look like over the observations:

randEffPlot = reviewedRest %>% 
  ungroup() %>% 
  mutate(mixedPrediction = predict(observationMod)) %>% 
  group_by(nameLocation) %>% 
  ggplot(., aes(observation, mixedPrediction, group = nameLocation)) + 
  geom_line(alpha = .5) + 
  theme_minimal()
  
randEffPlot

Amazing! We can really see the varying intercepts for each restaurant and we can see how the slopes are completely different (some are similar, but we have a completely mixed bag of directions and magnitudes).

For the sake of demonstration, and hopefully as a way to bring everyting together, let’s see what predictions from a standard regression might look like compared to our mixed model:

slimPred = predict(lm(SCORE ~ observation, data = reviewedRest))

reviewedRest = reviewedRest %>% 
  ungroup() %>% 
  mutate(slimPrediction = slimPred)

randEffPlot +
  geom_line(data = reviewedRest, aes(observation, slimPrediction), 
            color = "#ff5500", size = 2)

It looks like the standard regression line would do okay for many of our locations, but we can see that it would do poorly with many others. This should help to provide an illustration of just how flexible and powerful mixed model can be in your hands.

Missing Data

Data generated by people has missing values – this is an unfortunate truth. While shying away from the truth might someone’s idea of a good time, missing data provides us with some interesting issues to tackle.

Why Is Missing Data Important

The vast majority of statistical techniques are not robust to missingness. Take our standard linear model (lm):

library(dplyr)
library(magrittr)

testData = read.csv("missingSurvey.csv")

testData %>% 
  lm(average_weekly_hours ~ EMP_Engagement_5, data = .) %>% 
  summary()

Call:
lm(formula = average_weekly_hours ~ EMP_Engagement_5, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.4551  -5.5699   0.2005   5.3153  31.4301 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       17.9143     0.2439   73.46   <2e-16 ***
EMP_Engagement_5   2.8852     0.0913   31.60   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.031 on 13743 degrees of freedom
  (1254 observations deleted due to missingness)
Multiple R-squared:  0.06775,   Adjusted R-squared:  0.06768 
F-statistic: 998.7 on 1 and 13743 DF,  p-value: < 2.2e-16

Do you see any discrepancies? If we look at the number of rows within our data, we see a value of 14999 - a very decent sample for sure. If, however, we take a look at our degrees of freedom from our model output, we see that a considerable amount of rows were dropped (the output even gives us a message letting us know how many were dropped). A linear regression cannot use missing information, so the whole row gets dropped if any missingness exists within the variables being used. If an observation or dozen get(s) dropped out of a few hundred rows, we might not have too much to worry about. What about if more than half of the observations get dropped? It’s probably not a good thing.

Types Of Missingness

When we talk about missing data, it is easy to assume that it is just missing: nothing more and nothing less. What if we knew why those were missing. Would that change your thoughts on missingness? What if you knew that certain people did not responsd to certain questions (for example, people who do not identify with any particular religion may leave a question about religious strength blank). Just thinking about such an example, would grouping that kind of missingness with someone who just skipped a question purely through omission make sense? You would likely say that they are different and you would be correct. Generally, we can think of three different types of missing values: missing completely at random, missing at random, and missing not at random.

Data missing completely at random (MCAR), is exactly how it sounds – the missing data is not related to any study variables and it is not related to any unknown process or parameter about the person generating the data. We can assume that MCAR data is not biased.

Data missing at random (MAR) is a bit different and little trickier. MAR data, despite the name, assumes that data is not really missing at random, but is instead fully explained by some other observed variable not related to the missing data.

Missing not at random (MNAR) just extends MAR to assume that the missing data is caused by the missing data.

Determining Missingness Types

Determining which type of missingness is present within your data can be tricky. In the most ideal world, we would have the data to know whether it is MCAR or something else. I think you can see the circular nature that we are getting ourselves in!

There are, however, some methods that will let us test our missingness. The MissMech package will perform some tests that can help.

library(MissMech)

library(dplyr)

mcarTest = TestMCARNormality(testData)

mcarTest
Call:
TestMCARNormality(data = testData)

Number of Patterns:  7 

Total number of cases used in the analysis:  14999 

 Pattern(s) used:
          EMP_Engagement_1   EMP_Engagement_2   EMP_Engagement_3
group.1                  1                  1                  1
group.2                  1                  1                  1
group.3                  1                  1                  1
group.4                  1                 NA                  1
group.5                  1                  1                  1
group.6                 NA                  1                  1
group.7                  1                  1                 NA
          EMP_Engagement_4   EMP_Engagement_5   average_weekly_hours
group.1                  1                  1                     NA
group.2                  1                  1                      1
group.3                 NA                  1                      1
group.4                  1                  1                      1
group.5                  1                 NA                      1
group.6                  1                  1                      1
group.7                  1                  1                      1
          Number of cases
group.1               618
group.2             11214
group.3               635
group.4               658
group.5               636
group.6               618
group.7               620

    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  7.09661e-304 

    Either the test of multivariate normality or homoscedasticity (or both) is rejected.
    Provided that normality can be assumed, the hypothesis of MCAR is 
    rejected at 0.05 significance level. 

Non-Parametric Test:

    P-value for the non-parametric test of homoscedasticity:  1.413518e-08 

    Hypothesis of MCAR is rejected at  0.05 significance level.
    The multivariate normality test is inconclusive. 

In our output, we get a glimpse at what missing patterns exists. We can see that there are 11205 complete cases (group.2 has a 1 for every variable). If we look at the Hawkins test for multivariate normality and homoscedasticity, we would have to reject multivariate normality and/or homoscedasticity (i.e., we are rejecting the hypothesis that our data is normal and homoscedastic – not surprising since this is Likert-flavored survey data). Since multivariate normality is out of the picture, we cannot use the Hawkins test (if we could, though, it would demonstrate that we would need reject our MCAR assumption and accept that our data is not missing completely at random), we can go down to the non-parametric test. In looking at the non-parametric test, we can see that we likely cannot reject the MCAR assumption there either.

Had our results been different, we would have gotten a message stating, “There is not sufficient evidence to reject normality or MCAR at 0.05 significance level.”.

The language here is important – notice that there was very little in the way of absolutes because we can never be 100% confident that our data is missing completely at random.

The TestMCARNormality function will only take numeric variables, so be sure to keep that in mind when using it.

If we can establish MCAR, then we can proceed with some imputation without fear of introducing a terrible amount of bias. If we cannot establish that our missing data was missing completely at random, then our multiply imputed data might have some issues.

Imputation Methods

Many types of mean imputation exist: mean imputation, median imputation, among others. The central tendancy-based imputation methods are classical holdovers. In essence, they replace a missing value with the mean/median of the column.

Here is a brief demonstration of how this might work:

testData %>%
  mutate(average_weekly_hours = ifelse(is.na(average_weekly_hours) == TRUE, 
                                       mean(average_weekly_hours, na.rm = TRUE), 
                                       average_weekly_hours), 
         EMP_Engagement_5 = ifelse(is.na(EMP_Engagement_5), 
                                   mean(EMP_Engagement_5, na.rm = TRUE), 
                                   EMP_Engagement_5)) %>% 
  lm(average_weekly_hours ~ EMP_Engagement_5, data = .) %>% 
  summary()

Call:
lm(formula = average_weekly_hours ~ EMP_Engagement_5, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.283  -5.516  -0.378   5.251  33.622 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       18.2144     0.2344   77.69   <2e-16 ***
EMP_Engagement_5   2.7672     0.0879   31.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.912 on 14997 degrees of freedom
Multiple R-squared:  0.06199,   Adjusted R-squared:  0.06193 
F-statistic: 991.1 on 1 and 14997 DF,  p-value: < 2.2e-16

Our missing values are fixed, so all is well. In reality, we are making a pretty big logical leap with assuming that those missing values would be the mean. It might not be the worst assumption that we could make (if we knew that this was the actual population average, it might be slightly more reasonable), but it is still tenuous.

When statistics were done on punchcards, this was probalby about the best that could be done. With greater computing power and more advanced statistical methods, we can get much better imputation values.

Multiple Imputation

mice

Multiple imputation takes those simple imputation techniques much further. If we are using multivariate imputations by chained equations (mice), then we are going through a lengthy process to estimate the missing values and using several imputed datasets for our models. Data imputed with mice will use predictions on every variable around the missing data to predict the value of the missing data.

We are only interested in imputing our variable with missingness, so we will specify the “cart” method (classification and regression trees) for our engagement questions and predictive mean matching for our hours worked. There are many options depending upon your exact needs (i.e., data types), so spend some time looking at the built in imputation methods in mice. The “cart” and “pmm” methods were used here purely for flexibility.

library(mice)

imputedData = mice(testData, m = 10, maxit = 20, 
                   method = c("cart", "cart", "cart", 
                              "cart", "cart", "pmm"), 
                   pred = quickpred(testData, minpuc = .2, mincor = .01), 
                   print = FALSE)

imputedData
Class: mids
Number of multiple imputations:  10 
Imputation methods:
    EMP_Engagement_1     EMP_Engagement_2     EMP_Engagement_3 
              "cart"               "cart"               "cart" 
    EMP_Engagement_4     EMP_Engagement_5 average_weekly_hours 
              "cart"               "cart"                "pmm" 
PredictorMatrix:
                     EMP_Engagement_1 EMP_Engagement_2 EMP_Engagement_3
EMP_Engagement_1                    0                0                0
EMP_Engagement_2                    0                0                0
EMP_Engagement_3                    0                0                0
EMP_Engagement_4                    1                1                1
EMP_Engagement_5                    1                1                0
average_weekly_hours                1                0                1
                     EMP_Engagement_4 EMP_Engagement_5
EMP_Engagement_1                    0                0
EMP_Engagement_2                    1                1
EMP_Engagement_3                    1                0
EMP_Engagement_4                    0                0
EMP_Engagement_5                    0                0
average_weekly_hours                0                1
                     average_weekly_hours
EMP_Engagement_1                        0
EMP_Engagement_2                        1
EMP_Engagement_3                        1
EMP_Engagement_4                        1
EMP_Engagement_5                        1
average_weekly_hours                    0
plot(imputedData, layout = c(2, 1))

Will you look at that plot! These are trace lines. They track each imputation model over each iteration. What do we take out of these? If they look like fuzzy caterpillars, the chained equation process was good. If there is any discernable pattern, then something odd might have happened – these look just as crazy as they should! The mean for our imputed variables generally bounced around between 2.45 and 2.65. Given the observed mean, this seems pretty good.

On Markov Chains

The chained equations that we are dealing with in mice are very much akin to Markov Chain Monte Carlo methods.

Here is a conceptual example of markov chains! This example is adapted from Richard McElreath’s excellent book, Statistical Rethinking.

You manage 10 teams, named from 1 to 10. Each team name is proportional to the number of people on the team and each team is on a separate floor in the building.

You need to visit a team everyday, proportional to the number of people on the team (i.e., you would visit team 10 more often than team 1).

At the end of the day, you randomly select whether a proposed move will take you up a floor or down a floor.

After randomly selecting up or down, you grab a number of pens that is equal to the number of the current team (for team 10, you would grab 10 pens).

Next, you would grab a number of pencils corresponding to the proposed move (your randomly selected proposal would take you up a floor, so starting back at the bottom with team 1). So you would have 10 pens and 1 pencil.

If you have more pencils than pens, you will always move to the proposed floor.

If you have more pens than pencils, you set down a number of pens equal to the number of pencils and put them in a drawer.

You reach back into the drawer to randomly select a pen or a pencil.

Your selection decides where you go – pen is stay and pencil is go!

You should play with the following function (uncomment the line, play with starting values, etc.):

markovSim = function(daysRun, startDay) {
  
  position = rep(0, daysRun)
  
  current = startDay
  
  for(i in 1:daysRun) {
    position[i] = current
    
    proposal = current + sample(c(-1, 1), size = 1)
    
    if(proposal < 1) proposal = 10
    
    if(proposal > 10) proposal = 1
    
    probabilityMove = proposal / current
    
    current = ifelse(runif(1) < probabilityMove, proposal, current)
    
    # print(paste(position[i], proposal, probabilityMove, current, sep = " -> "))
  }
  
  return(position)
}

test1 = markovSim(1000, 5)

test2 = markovSim(1000, 6)

library(ggplot2)

ggplot() + 
  geom_line(data = as.data.frame(test1), aes(1:length(test1), test1), 
            color = "#ff5500", size = .75) +
  geom_line(data = as.data.frame(test2), aes(1:length(test2), test2), 
            color = "black", size = .75) +
  theme_minimal()

Does that plot look at all familiar – it sure does look like a Pyrrharctia isabella larvae to me. Beyond its similarity to woolybear caterpillar, this is a perfect illustration of a markov chain.

As fun as that was, let’s get back to the matter at hand.

Now, we have 10 imputed data sets that have been through 20 iterations that we can perform separate analyses on.

fit1 = with(data = imputedData, 
            exp = lm(average_weekly_hours ~ EMP_Engagement_5))

We can check on each individual imputed data set:

summary(fit1)
# A tibble: 20 x 5
   term             estimate std.error statistic   p.value
   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)         17.9     0.235       76.4 0.       
 2 EMP_Engagement_5     2.89    0.0878      32.9 1.23e-228
 3 (Intercept)         17.9     0.234       76.4 0.       
 4 EMP_Engagement_5     2.89    0.0877      32.9 7.62e-230
 5 (Intercept)         17.9     0.234       76.2 0.       
 6 EMP_Engagement_5     2.91    0.0877      33.1 3.57e-232
 7 (Intercept)         18.0     0.234       77.0 0.       
 8 EMP_Engagement_5     2.85    0.0875      32.6 1.24e-224
 9 (Intercept)         17.8     0.234       76.0 0.       
10 EMP_Engagement_5     2.91    0.0876      33.3 5.52e-234
11 (Intercept)         17.9     0.234       76.2 0.       
12 EMP_Engagement_5     2.91    0.0878      33.1 5.27e-232
13 (Intercept)         17.8     0.234       76.0 0.       
14 EMP_Engagement_5     2.93    0.0876      33.5 5.97e-237
15 (Intercept)         17.7     0.234       75.8 0.       
16 EMP_Engagement_5     2.95    0.0875      33.7 1.02e-239
17 (Intercept)         17.9     0.234       76.5 0.       
18 EMP_Engagement_5     2.91    0.0875      33.3 3.63e-234
19 (Intercept)         17.8     0.233       76.2 0.       
20 EMP_Engagement_5     2.94    0.0874      33.7 5.26e-240

All of our coefficients are hoving around or slightly above our original coefficient, but our standard errors are uniformly smaller.

And we can then pool those results together:

pooledFit = pool(fit1)

summary(pooledFit)
                  estimate  std.error statistic       df p.value
(Intercept)      17.847540 0.25054709  71.23427 531.0723       0
EMP_Engagement_5  2.908382 0.09281962  31.33370 718.7926       0

How different are these pooled coefficients compared to our previous coefficients? The pooled are certainly a bit stronger, but not terribly so.

Missing Or Sparse

On occasion you will see some data that looks like it has a lot of missingness. It might not truly be missing, it just might be sparse. Missing means that something was skipped (think about a survey question) and you have no idea what the value actually is. In sparse data, the observations have a zero value.

Sparsity comes up in a great many places, but the following example will hopefully help to make the point clearer. I shop on Amazon, but I have not bought everything there is to buy on Amazon. Not only have I not bought everything, I have not even seen every product on Amazon. If there were to be a matrix of every item that I have bought off of Amazon (all of those products get a 1) and everything I have not bought off Amazon (all of those products would get a 0), we would have very sparse data. In other words, Amazon actually knows what I have and have not bought.

On the other hand, let’s say that I just got a middle of the road Red Dragon keyboard RGB mechanical keyboard. Amazon knows that I bought it, so I would get a 1 in that column. However, I have not rated said keyboard; therefore, Amazon has no idea how much I actually like the keyboard. Essentially, there is a blank in the ratings column that cannot be assumed to be 0.

There is no shortage of ways to cluster data and the reasons to do it are just as numerous. Clustering is great for reducing the number of dimensions within your data (i.e., combining several variables into one), but it is equally well suited to finding natural groups within data.

There are several large classes of cluster analyses, such as centroid or hierarchical clustering. Two of the bigger divisions are hard clustering and fuzzy clustering. Hard clustering algorithms assign observations to a specific cluster, whereas fuzzy clustering algorithms are probablistic in nature (i.e., an observation has a certain probability to belonging to any one cluster). In

Centroid Clustering

Centroid clustering is based upon distances around some centroid (maybe a mean). Many centroid models tend to produce hard clustering by nature.

Careful attention needs to be paid to your data when clustering. Are they variables that naturally contain mean? If so, a k-means clustering might work well.

Determining Cluster Numbers

Determining the number of clusters is very much akin to determining the number of factors in factor analysis – there can be theory guiding the practice, but various statistics can help us make our decisions too. When clustering, especially hard clustering, we need to be thoughtful about the clarity of our clusters – too few clusters might not be helpful, but too many clusters might not separate very well. There are several ways to test the appropriate numbers of factors (scree plots, silhouette), but we are going to use the GAP statistic to select an appropriate number of k.

As an example, let’s use the data from the Holzinger Swineford article, found in the lavaan package.

library(factoextra)

library(lavaan)

library(dplyr)

testData = HolzingerSwineford1939 %>% 
  dplyr::select(starts_with("x"))

kTest = NbClust::NbClust(testData, method = "kmeans")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 8 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 5 proposed 8 as the best number of clusters 
* 1 proposed 13 as the best number of clusters 
* 1 proposed 14 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 

We are going to get a lot of information from the NbClust function, but a majority rule for number of clusters is the most helpful.

With our proposed number of clusters, let’s turn to some models.

k-means

Let’s run a k-mean model with our test data.

kmeansTest = kmeans(testData, 2)

kmeansTest
K-means clustering with 2 clusters of sizes 129, 172

Cluster means:
        x1       x2       x3       x4       x5       x6       x7       x8
1 5.683463 6.565891 2.813953 3.894057 5.259690 2.973422 4.320863 5.743023
2 4.375000 5.729651 1.827762 2.436047 3.651163 1.594684 4.084681 5.365116
        x9
1 5.785745
2 5.065407

Clustering vector:
  [1] 2 2 2 1 2 2 2 2 1 2 2 1 1 2 1 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2
 [36] 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2
 [71] 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 1 2 1
[106] 1 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1
[141] 2 2 1 1 2 2 1 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2
[176] 1 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 1 2 2 2 2
[211] 1 1 1 1 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2 1 2 2 2 1 1 1 1 1 1 1 1 2
[246] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 2 1
[281] 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 1

Within cluster sum of squares by cluster:
[1] 1169.721 1484.421
 (between_SS / total_SS =  22.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

Our output provides some very useful information. First, we see the cluster means for each item across the two cluster. These give us a pretty good idea of the “location” for each of the clusters and helps us to start to get an idea for what these cluster look like (we essentially have one higher scoring group and one lower scoring group).

Next, we see our clustering vector. This tells us to which cluster each observation was assigned.

We can also take a look at the bivariate plots, which will give us the cluster membershp within each of the bivariate plots.

plot(testData, col = kmeansTest$cluster)

We can see the variables were our clusters separate pretty nicely (e.g., x1 and x4) and were they do not (x8 and x9).

And a plot based upon the first two components of a PCA:

cluster::clusplot(testData, kmeansTest$cluster)

The k-means cluster is great and widely used. But, it can be picky about data (it really likes data without extreme values) and it can have different results based upon item ordering (because of the cluster assignment process).

medoids

To circumvent the issue within k-means, we can partition around medoids. Whereas our centroid can be an arbitrary value that gets observations clustered around it, a medoid is an actual observation within the data that then gets other observations clustered around it. In other words, the centroid does not relate to a specific observation, so the centroid cannot be linked to a specific observation.

library(cluster)

pamTest = cluster::pam(testData, k = 2)

pamTest
Medoids:
      ID       x1   x2  x3       x4   x5       x6       x7   x8       x9
[1,] 162 5.000000 6.25 2.5 3.333333 5.75 2.571429 4.086957 5.65 5.583333
[2,]  11 3.666667 5.75 2.0 2.000000 3.50 1.571429 3.739130 5.70 4.305556
Clustering vector:
  [1] 1 2 2 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2 2 1 1 2
 [36] 2 1 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 1 2 1 2 1 1 2 2 1
 [71] 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 2 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1
[106] 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1
[141] 1 2 1 1 2 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1
[176] 1 2 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 2 2 2 2
[211] 1 1 1 1 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1
[246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1
[281] 2 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1
Objective function:
   build     swap 
3.159638 2.999334 

Available components:
 [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
 [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      

We can see very similar output to what we got from k-means, but we get the actual observations that represent the medoids. This identification helps to us to get a better idea about what the clusters look like.

cluster::clusplot(testData, pamTest$clustering)

If we look at both plots, we can notice some definite differences in cluster assignment when we are getting towards the 0s (pam puts some observations in one cluster, while kmean puts the same observation into a different cluster). This is purely a function of how the two methods compute the centroid.

par(mfrow = c(1, 2))

cluster::clusplot(testData, pamTest$clustering, main = "PAM")

cluster::clusplot(testData, kmeansTest$cluster, main = ("KMeans"))

Distribution Clustering

Any centroid-based clustering algorithm works great when our data looks pretty normal. Data that might not be normally distributed might cause some issues in finding a decent centroid.

Imagine a variable distributed in the following manner:

It looks like it has something close to a normal distribution on the right and something a bit lumpy on the left. Using a centroid-based algorithm would find a centroid, but we could almost imagine that this distribution is a mix of two different distributions – any centroid is going to betray the actual distribution of this data. Situations like this call for distribution clustering

Sometimes referred to as model-based clustering, distribution clustering allows us to model variables that have mixed distributions (maybe a mixture of normal distributions or a normal distribution and something like the bimodal-looking distribution we just saw).

If your data have a mixture of normals or you just want the actual distribution of your data to drive the clustering, then these distribution clustering algorithms will offer better solutions. Furthermore, they are better suited for clustering items that might have a latent structure.

Given that we are using our data’s distributions, we need to do another check on the number of clusters. Not only will we get a good number of clusters, but we will also get the clustering shapes that best fit our data.

library(mclust)

bicTest = mclustBIC(testData)

bicTest
Bayesian Information Criterion (BIC): 
        EII       VII       EEI       VEI       EVI       VVI       EEE
1 -8395.208 -8395.208 -8411.764 -8411.764 -8411.764 -8411.764 -7698.368
2 -8091.098 -8096.223 -8098.389 -8103.418 -8087.308 -8092.388 -7684.869
3 -8043.834 -8026.535 -7946.965 -7948.718 -7978.589 -7981.702 -7720.952
4 -7997.530 -7996.730 -7925.971 -7930.479 -7995.874 -8017.191 -7757.011
5 -7974.597 -7980.670 -7896.040 -7898.856 -7998.222 -8002.864 -7766.243
6 -7967.265 -7976.579 -7914.053 -7913.319 -8024.885 -8023.845 -7814.074
7 -7966.191 -7972.153 -7910.080 -7928.153 -8071.526 -8078.025 -7816.869
8 -7974.345 -8002.181 -7926.699 -7948.928 -8109.304 -8123.281 -7858.328
9 -8002.500 -8035.987 -7960.374 -7972.648 -8150.685 -8173.271 -7889.681
        EVE       VEE       VVE       EEV       VEV       EVV       VVV
1 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368
2 -7705.746 -7683.987 -7703.536 -7808.094 -7816.565 -7843.921 -7833.820
3 -7744.304 -7715.837 -7759.226 -7956.501 -7973.278 -8059.269 -8045.041
4 -7812.783 -7746.291 -7808.855 -8125.923 -8115.375 -8226.645 -8225.402
5 -7864.346 -7781.799 -7879.806 -8277.594 -8292.735 -8402.984 -8457.127
6 -7952.675 -7818.828 -7945.551 -8411.152 -8456.776 -8613.663 -8680.916
7 -7980.994 -7829.336 -7936.425 -8586.715 -8612.927 -8804.420 -8834.708
8 -8023.371 -7866.886 -8058.225 -8789.471 -8814.082 -9049.331 -9086.450
9 -8094.573 -7906.292 -8111.024 -8969.355 -9024.142 -9281.446 -9344.505

Top 3 models based on the BIC criterion: 
    VEE,2     EEE,2     EEE,1 
-7683.987 -7684.869 -7698.368 

In looking at the bottom of the ouput (Top 3 models based on the BIC criterion), we see that a two cluster solution might work well, given that those are the first two solutions offered. The best solution, however, is “VEE, 2” – what does VEE mean? It means that our mixture is ellipsoidal in shape with equal volume and orientation. Check the mclustModelNames function for all names.

Let’s take a peak at it:

mclustTest = Mclust(testData, 2, modelNames = "VEE")

summary(mclustTest)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VEE (ellipsoidal, equal shape and orientation) model with 2
components: 

 log.likelihood   n df       BIC       ICL
      -3656.513 301 65 -7683.987 -7715.366

Clustering table:
  1   2 
205  96 

We can see that the provided output is very different than our previous clustering efforts. We have a cluster table, demonstrating how many observations are grouped into the two clusters. We also have mixing probabilites.

mclustTest$parameters$pro
[1] 0.6638782 0.3361218

The mixing probabilities add up to 1 and are just the probabilities that an observation will wind up in one cluster or another.

Now, let’s look at our clustering centers:

mclustTest$parameters$mean
       [,1]     [,2]
x1 4.742245 5.318003
x2 5.768518 6.719131
x3 1.599823 3.535407
x4 2.941652 3.296451
x5 4.294729 4.430997
x6 2.066251 2.421243
x7 4.155177 4.246587
x8 5.435710 5.707536
x9 5.289581 5.541104

This is giving us an idea about the center for each of the clusters.

plot(mclustTest, what = "classification")

We can also look at the components graph (like we saw earlier):

coordProj(testData, parameters = mclustTest$parameters, z = mclustTest$z)

Hierarchical Clustering

Everything we have seen to this point needs some type of work to know how many clusters to find.

Conversely, hiearchical clustering will cluster without any input. These hierarchical clustering approach can start with every observation as its own cluster and then start connecting those individual clusters into larger clusters. This continues until every cluster is clustered under a larger cluster. This process is known as agglomerative clustering (the reverse, where everything starts in one big cluster and is broken down into smaller clusters, is known as divisive clustering).

hieararchicalTest = hclust(dist(testData))
plot(hieararchicalTest)

This monstrous-looking dendogram is showing us how the observations cluster all the way up the tree. The y-axis, Height, is telling us how similar observations are when they get clustered together – i.e., observations that are clustering at height 1 are much more similar than observations clustering at height 6.

If we want to get an idea about cluster membership, we need to determine where we want to cut our tree. For consistency with our other clustering algorithms, we can just go with defining two clusters:

hier2Clus = cutree(hieararchicalTest, k = 2)

This provides the cluster membership so that we can identify the observations in our data that belong to a cluster:

head(testData[hier2Clus == 2, ])
         x1   x2    x3       x4   x5       x6       x7   x8       x9
15 5.833333 5.75 3.625 5.000000 5.50 3.000000 4.000000 4.35 5.861111
19 5.666667 5.25 4.000 4.333333 5.25 3.714286 3.913044 4.85 5.750000
20 6.333333 8.75 3.000 3.666667 3.75 2.571429 3.478261 5.35 4.916667
67 4.000000 9.25 4.000 3.000000 3.75 1.000000 3.608696 5.75 6.194444
74 5.000000 6.00 3.250 3.333333 5.75 2.714286 3.478261 4.60 4.277778
82 4.666667 8.00 4.250 3.000000 4.75 2.428571 5.130435 5.75 5.305556

Finally, we could get the averages for each variable for both clusters:

rbind(cluster1 = testData[hier2Clus == 1, ] %>% 
        summarize_all(mean), 
      cluster2 = testData[hier2Clus == 2, ] %>% 
        summarize_all(mean))
               x1       x2       x3       x4       x5       x6       x7
cluster1 4.722452 5.844008 2.042872 2.749311 4.047521 1.864227 4.120014
cluster2 5.810734 7.088983 3.101695 4.338983 5.542373 3.503632 4.456153
               x8       x9
cluster1 5.421694 5.234848
cluster2 5.959322 5.945386

Just like everything else we have seen, we have a lower scoring group and a higher scoring group.

A Brief Word On Fuzzy Clustering

To this point, we have been using “hard” clustering – an observation is put into one cluster and there is no notion that the observation could belong to another. Fuzzy clustering changes this, by providing probabilities of class membership.

fuzzyTest = fanny(testData, k = 2, memb.exp = 1.25)

We can get our clusters, just like we did with everything else:

head(fuzzyTest$clustering)
[1] 1 1 1 2 1 1

But a key difference is that we can also get the membership coefficients

head(fuzzyTest$membership)
          [,1]       [,2]
[1,] 0.5830297 0.41697030
[2,] 0.7306898 0.26931022
[3,] 0.9219575 0.07804253
[4,] 0.3316131 0.66838689
[5,] 0.7620418 0.23795817
[6,] 0.7734978 0.22650221

These can essentially be thought of as the probability of belonging to a cluster.

fviz_cluster(fuzzyTest)

We can see that a great deal of the observations are pretty clear, but we can look at one of the observations that are resting between the 2 clusters.

fuzzyTest$membership[67, ]
[1] 0.4565806 0.5434194

We can see that it is not exactly 50/50, but it is pretty clear that this observation’s membership in cluster 2 is not 100% concrete.

Latent Class Analysis

Can There Really Be More Latent Stuff?

We have covered a big slice of models that deal with some type of latent structure. Latent class analysis (LCA) differs from what we have previously seen. In factor analysis, for example, we are finding the latent variables that give cause to the measured items. LCA, on the other hand, attempts to use variables to find latent classes of people. For example, if we have variables regarding race, gender, education, and employment status, can we find naturally occuring groups that cluster together?

LCA

LCA has definite conceptual relationships to cluster analysis and factor analysis. From a technical standpoint, however, they are different. While we saw that certain variable types work better for different types of cluster analyses, categorical variables are the only variables that can be used in LCA (e.g., it could not use raw age – you would unfortuantely need to discretize it first). The latent classes that emerge in LCA are often noted as “profiles” (some people even call it latent profile analysis) – thinking of them as profiles allows you to think about constructing general profiles and grouping people into those profiles.

Let’s try to fit a model with 2 latent classes based upon education, gender, and party identification. We can use the election data from poLCA.

library(poLCA)

library(dplyr)

data("election")

lcaDat = election

lcaFormula = cbind(EDUC, GENDER, PARTY) ~ 1

lcaMod2 = poLCA(lcaFormula, lcaDat, 2, maxiter = 5000)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$EDUC
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0445 0.0838 0.3899 0.2047 0.0995 0.1302 0.0475
class 2:  0.0182 0.0272 0.1472 0.2196 0.0863 0.3223 0.1793

$GENDER
           Pr(1)  Pr(2)
class 1:  0.3374 0.6626
class 2:  0.5809 0.4191

$PARTY
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.2304 0.2139 0.1624 0.1673 0.0888 0.0768 0.0604
class 2:  0.1472 0.0715 0.1362 0.0357 0.1895 0.1832 0.2367

Estimated class population shares 
 0.5863 0.4137 
 
Predicted class memberships (by modal posterior prob.) 
 0.608 0.392 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 1755 
number of estimated parameters: 27 
residual degrees of freedom: 70 
maximum log-likelihood: -7628.878 
 
AIC(2): 15311.76
BIC(2): 15459.45
G^2(2): 89.21528 (Likelihood ratio/deviance statistic) 
X^2(2): 86.54713 (Chi-square goodness of fit) 
 

Our output is showing us our two classes, as we specified, and the probabilities of each variables response category belonging to a class. This will be explained in more depth shortly. The output also gives us our estimated population shares (how many are observed) and the predicted class membership (how many would be predicted).

We also get some fit statistics and those are interpreted as we would normally expect.

Seeing a plot of the class probabilities will be helpful as an explanatory tool and the help for the election data is critical for knowing what the levels are within the data:

EDUC is 1 (8th grade or less) through 7 (advanced degree), gender is 1 (male), and party is 1 (strong Democrat) to 7 (strong Republican).

plot(lcaMod2)

If we had to put a face to class 1, we might imagine mostly men, with some level of college eduction, who trend Republican. Class 2 is largely women, with a very slightly trend towards Democrat, with a slightly less education.

One interesting thing about LCA is that an automated class selection mechanism really does not exist – you need to engage in some model comparison.

Let’s try models with 3, 4, and 5 classes.

lcaMod3 = poLCA(lcaFormula, lcaDat, 3, maxiter = 5000)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$EDUC
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0000 0.0288 0.2160 0.2349 0.1296 0.2757 0.1149
class 2:  0.0805 0.0700 0.1452 0.1752 0.0001 0.3002 0.2288
class 3:  0.0513 0.0860 0.3922 0.1987 0.0868 0.1275 0.0576

$GENDER
           Pr(1)  Pr(2)
class 1:  0.4086 0.5914
class 2:  1.0000 0.0000
class 3:  0.3186 0.6814

$PARTY
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.1005 0.1315 0.1240 0.0199 0.1699 0.1748 0.2794
class 2:  0.2087 0.0000 0.1672 0.0734 0.2157 0.1884 0.1466
class 3:  0.2776 0.2161 0.1721 0.2057 0.0733 0.0553 0.0000

Estimated class population shares 
 0.4138 0.1208 0.4654 
 
Predicted class memberships (by modal posterior prob.) 
 0.384 0.0872 0.5288 
 
========================================================= 
Fit for 3 latent classes: 
========================================================= 
number of observations: 1755 
number of estimated parameters: 41 
residual degrees of freedom: 56 
maximum log-likelihood: -7615.985 
 
AIC(3): 15313.97
BIC(3): 15538.25
G^2(3): 63.42943 (Likelihood ratio/deviance statistic) 
X^2(3): 63.31322 (Chi-square goodness of fit) 
 
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
 
plot(lcaMod3)

lcaMod4 = poLCA(lcaFormula, lcaDat, 4, maxiter = 5000)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$EDUC
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0000 0.0000 0.0036 0.2239 0.1350 0.4074 0.2301
class 2:  0.0000 0.0652 0.5339 0.2625 0.0438 0.0946 0.0000
class 3:  0.1639 0.1330 0.1808 0.0547 0.2288 0.1251 0.1137
class 4:  0.1957 0.2135 0.4163 0.0789 0.0000 0.0422 0.0534

$GENDER
           Pr(1)  Pr(2)
class 1:  0.5766 0.4234
class 2:  0.3246 0.6754
class 3:  0.0000 1.0000
class 4:  1.0000 0.0000

$PARTY
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.1560 0.1038 0.1374 0.0563 0.1720 0.1609 0.2136
class 2:  0.1882 0.1657 0.1757 0.1603 0.1266 0.0716 0.1119
class 3:  0.3032 0.3709 0.0901 0.0987 0.0296 0.1076 0.0000
class 4:  0.2789 0.0450 0.1673 0.1325 0.0959 0.2207 0.0596

Estimated class population shares 
 0.3711 0.4396 0.1078 0.0815 
 
Predicted class memberships (by modal posterior prob.) 
 0.4376 0.433 0.0764 0.053 
 
========================================================= 
Fit for 4 latent classes: 
========================================================= 
number of observations: 1755 
number of estimated parameters: 55 
residual degrees of freedom: 42 
maximum log-likelihood: -7601.741 
 
AIC(4): 15313.48
BIC(4): 15614.34
G^2(4): 34.94146 (Likelihood ratio/deviance statistic) 
X^2(4): 34.35936 (Chi-square goodness of fit) 
 
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
 
plot(lcaMod4)

lcaMod5 = poLCA(lcaFormula, lcaDat, 5, maxiter = 5000)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$EDUC
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.1585 0.1462 0.3503 0.0324 0.1510 0.0130 0.1485
class 2:  0.1920 0.2198 0.4563 0.1135 0.0008 0.0176 0.0000
class 3:  0.0000 0.0469 0.4691 0.2706 0.0698 0.1436 0.0000
class 4:  0.0000 0.0000 0.0001 0.1768 0.0898 0.3918 0.3416
class 5:  0.0000 0.0435 0.2192 0.2813 0.1456 0.2929 0.0175

$GENDER
           Pr(1)  Pr(2)
class 1:  0.0000 1.0000
class 2:  1.0000 0.0000
class 3:  0.3233 0.6767
class 4:  0.6588 0.3412
class 5:  0.3909 0.6091

$PARTY
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.3286 0.3430 0.1008 0.0950 0.0349 0.0977 0.0000
class 2:  0.2807 0.0536 0.1729 0.1378 0.0912 0.2122 0.0517
class 3:  0.1960 0.0858 0.2394 0.2080 0.1740 0.0000 0.0967
class 4:  0.1771 0.0495 0.1673 0.0594 0.2128 0.1277 0.2062
class 5:  0.1147 0.3298 0.0060 0.0142 0.0309 0.2903 0.2140

Estimated class population shares 
 0.1114 0.0831 0.3536 0.2393 0.2126 
 
Predicted class memberships (by modal posterior prob.) 
 0.0957 0.0764 0.3812 0.2348 0.212 
 
========================================================= 
Fit for 5 latent classes: 
========================================================= 
number of observations: 1755 
number of estimated parameters: 69 
residual degrees of freedom: 28 
maximum log-likelihood: -7595.543 
 
AIC(5): 15329.09
BIC(5): 15706.53
G^2(5): 22.54605 (Likelihood ratio/deviance statistic) 
X^2(5): 22.1376 (Chi-square goodness of fit) 
 
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
 
plot(lcaMod5)

Using each model’s AIC, let’s see which model might work the best:

rbind(class2 = lcaMod2$aic, 
      class3 = lcaMod3$aic, 
      class4 = lcaMod4$aic, 
      class5 = lcaMod5$aic)
           [,1]
class2 15311.76
class3 15313.97
class4 15313.48
class5 15329.09

There is not too much difference between 2 and 3 classes, so we could stick with parsimony and be happy with 2 classes.

Latent class analysis will also let us include a covariate in our model, essentially performing a regression. Let’s define a latent class based upon views of presidential candidates (demarked by a B or G after the variable name):

covariateFormula = cbind(LEADG, MORALG, LEADB, MORALB) ~ PARTY

intelPartyLCA = poLCA(covariateFormula, lcaDat, 2, maxiter = 5000)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$LEADG
          1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1:            0.2524       0.5765         0.1476            0.0234
class 2:            0.0118       0.2264         0.5240            0.2379

$MORALG
          1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1:            0.3821       0.5261         0.0736            0.0182
class 2:            0.0935       0.4234         0.3116            0.1716

$LEADB
          1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1:            0.0621       0.4034         0.3756            0.1590
class 2:            0.3013       0.6316         0.0575            0.0096

$MORALB
          1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1:            0.1016       0.4847         0.3178            0.0960
class 2:            0.3621       0.5576         0.0687            0.0117

Estimated class population shares 
 0.569 0.431 
 
Predicted class memberships (by modal posterior prob.) 
 0.5696 0.4304 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
2 / 1 
            Coefficient  Std. error  t value  Pr(>|t|)
(Intercept)    -5.65737     0.45270  -12.497         0
PARTY           1.32879     0.10246   12.969         0
========================================================= 
number of observations: 1459 
number of estimated parameters: 26 
residual degrees of freedom: 229 
maximum log-likelihood: -6438.342 
 
AIC(2): 12928.68
BIC(2): 13066.11
X^2(2): 3273.342 (Chi-square goodness of fit) 
 

We have the same output with regard to our class membership that we had before, but we also get some information about our regression model fit. In our model, we specified two latent classes; examining our latent class would lead us to assume that those in class 1 favor leader B, while those in class 2 favor leader G. In our regression output, we are comparing class 1 to class 2. We see our typical regression results and we know that it is a significant one, but what exactly does it mean?

To start, we are comparing the likelihood that a person will belong to class 2 as opposed to class 1 (that is what the 2 / 1 indicates). If we use our PARTY scores with our provided equation, we will get at the log-ratio prior probability:

classPriors = exp(cbind(1, c(1:7)) %*% intelPartyLCA$coeff)

priorProbs = cbind(1, classPriors) / (1 + rowSums(classPriors))

priorProbsData = data.frame(class1B = priorProbs[, 1], 
                           class2G = priorProbs[, 2], 
                           partyValue = 1:7) %>% 
  tidyr::gather(.,key = class, value = probability, -partyValue)

library(ggplot2)

ggplot(priorProbsData, aes(x = partyValue, y = probability, color = class)) + 
  geom_line() +
  theme_minimal()

Recall that PARTY is coded 1 (strong Democrat) to 7 (strong Republican). Someone with a PARTY value of 1 would have a near probability of 0 to belonging to class 1 and a near 1 probability of belonging to class 2. Conversely, someone with a party value of 7 would have a near 1 probability to belong to class 1 and a near 0 probability of for class 2.

Using this added feature of latent class analysis can help to make more sense of the classes and give you a better idea about how people within those classes might behave.

Generalized Additive Models

Parametric…Or Not

Most of the statistical shenanigans you have seen to this point has come from the parametric family. In other words, we are making assumptions about the underlying distributions. What if we don’t make any assumptions or we really have no idea and we want to let it be defined by our actual data? Then, we are operating in a non-parametric space. How do we let our data do the talking?

Smoothing

There are many types of smooths. You will frequently see the loess (local regression – sometimes you will hear it as locally-weighted scatterplot smoothing or lowess). With a loess line, we are fitting some polynomial (generally the linear or the quadratic) to a small section of our data at a time (i.e., a local group) – this is a little bit more complicated than our moving average window type of smooth. Each small section has an associated line and each line gets joined with the line in the next group (these are referred to as knots). Since we are largely in control here, we get to specify how wiggly things might get.

You will also see regression splines (largely what we will be using here today). The great thing about these is that we can penalize them!

Additive Models

Very briefly, an additive model is not much different than our normal interpretation of a model. In our additive model, we can look at the affect of a predictor on a dependent variable without any consideration for what other variables might be in the model. We can add these effects to best predict our response.

GAM

GMAT TIME – lm is to glm, as additive models are to…?

During the last few weeks, we have largely been working in the generalized linear models framework. We are going to stay in the general vicinity, but start moving to some more interesting places! We have mostly seen straight lines being fitted to various things. As many of you have likely noted, it often seems like a straight line doesn’t really fit the relationships that we can see within our data. So, what do you do? We could always go the transformation route, but that seems a bit antiquated at this point…don’t you think?

What if we fit a smooth line to our data instead of trying to jam a single straight line somewhere it does not want to be or do something like throwing a single quadratic term into the model? Now we are doing something interesting.

The car package is old-school R, but still has some handy stuff for us.

library(car)

library(dplyr)

noLeaders = read.csv("leadershipRatingsAgreement.csv")

scatterplotMatrix(noLeaders[, !(names(noLeaders) %in% c("leaderID"))])

The splom that we just saw gives us a really good idea about the relationships within the data. The green line is a linear line and the red line is a smoothed line. If those are not sitting on top of each other, then you might want to think carefully about the relationship that is present.

plotDat = noLeaders %>% 
  dplyr::select(effect, enabling) %>% 
  na.omit

plot(effect ~ enabling, data = plotDat)
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ enabling, 
                data = plotDat))[order(plotDat$enabling)], col = "red")
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ I(enabling^2), 
                data = plotDat))[order(plotDat$enabling)], col = "blue")
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ I(enabling^3), 
                data = plotDat))[order(plotDat$enabling)], col = "green")

The preceding figure shows us 3 different lines: a linear regression line, and two higher-order trends. We will use them as a reference.

Let’s check this out:

lmTest = lm(effect ~ enabling, data = noLeaders)

summary(lmTest)

Call:
lm(formula = effect ~ enabling, data = noLeaders)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6508 -0.3554 -0.1035  0.2485  3.6317 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.71439    0.01218   58.66   <2e-16 ***
enabling     0.91594    0.03018   30.35   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5379 on 4619 degrees of freedom
  (265 observations deleted due to missingness)
Multiple R-squared:  0.1663,    Adjusted R-squared:  0.1661 
F-statistic: 921.2 on 1 and 4619 DF,  p-value: < 2.2e-16

Nothing too new here, so let’s move along!

library(mgcv)

gamTest = gam(effect ~ enabling, data = noLeaders)

summary(gamTest)

Family: gaussian 
Link function: identity 

Formula:
effect ~ enabling

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.71439    0.01218   58.66   <2e-16 ***
enabling     0.91594    0.03018   30.35   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.166   Deviance explained = 16.6%
GCV = 0.28951  Scale est. = 0.28939   n = 4621

You should notice that there is no difference between our standard linear model and our gam with regard to the coefficient. If we do not smooth a variable, it gets treated just like it would in a linear regression model. We also get some output such as adjusted R^2 (interpreted as per normal) and we also have deviance explained, which is giving us very similiar information to adjusted R^2 (instead of looking at the sums of square error between fitted and observed, it just uses a different error calculation). The scale estimate, in this case, is the residual standard error squared. GCV is the minimized generalised cross-validation and it gives us an idea about our prediction error (ideally, we want this to be a small value).

Let’s try to smooth. In the following code, you will notice how we wrapped out term in s(). Believe it or not, this is to specify a smooth term. We could spend a whole week on different ways to smooth things, but we will just stick with s() and its defaults for now.

gamTestSmooth = gam(effect ~ s(enabling), data = noLeaders)

summary(gamTestSmooth)

Family: gaussian 
Link function: identity 

Formula:
effect ~ s(enabling)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.995363   0.007893   126.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df     F p-value    
s(enabling) 4.926  6.073 156.4  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.17   Deviance explained = 17.1%
GCV = 0.28827  Scale est. = 0.2879    n = 4621

After smoothing our term, we can see that our output has changed. Instead of getting a linear regression coefficient, we get an edf (estimated degrees of freedom). While these edf values lack the clean interpretation of our linear regression coefficients, we can still get a great deal of information from them. The closer edf is to 1, the more linear in nature the term actually is. However, as edf goes beyond 1, we have an increasingly wigglier relationship.

Since we included a smooth term, we can see that our model fit has improved from our previous gam without a smooth term.

If we plot our newly-fitted gam model back onto our previous visualization, here is what we get:

plot(effect ~ enabling, data = plotDat)
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ enabling, 
                data = plotDat))[order(plotDat$enabling)], col = "red")
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ I(enabling^2), 
                data = plotDat))[order(plotDat$enabling)], col = "blue")
lines(sort(plotDat$enabling), 
      fitted(lm(effect ~ I(enabling^3), 
                data = plotDat))[order(plotDat$enabling)], col = "green")
lines(sort(plotDat$enabling), 
      fitted(gam(effect ~ s(enabling), 
                 data = noLeaders))[order(plotDat$enabling)], 
      col = "orange") 

The soft orange line is our gam fit. We can see that it does not rocket upwards, like our higher-order terms, but is instead capturing a bit of the downward trend towards the larger values of the enabling variable.

Bias/Variance Trade-Off

The wiggle can be controlled and you are the one to control it (all models are your monster, so build them in a way that you can control it). An important consideration to make with the wiggle (and with almost all of our decision from here on out) is the bias/variance trade-off. You will see this called other things (e.g., error/variance), depending on with whom you are hanging around. Since we have only talked about bias briefly, we do not need to worry about getting bias in this sense conflated with anything else.

It works like this: you cannot have your cake and eat it too. Do you want your in-sample predicition to be awesome (low bias)? Great! You can count on getting that at the expense of higher variance. The lower the variance, the better your model will predict new data. Well that sounds easy – just go with the lowest variance. But…that might contribute to missing some weird pattern. Again, it is just a decision to make (you likely won’t be facing off with your monsters in the Arctic in the end).

With our gam models, the wigglier your line, the lower your bias will be and the better you are doing at predicting in sample.

library(ggplot2)

gamTestLambda1 = gam(effect ~ s(enabling, sp = 0, k = 40), data = noLeaders)

p = predict(gamTestLambda1, type = "lpmatrix")

beta = coef(gamTestLambda1)

s = p %*% beta

plotDat = cbind.data.frame(s = s, enabling = na.omit(noLeaders$enabling))

gam1Plot = ggplot(plotDat, aes(enabling, s)) + 
  geom_line(color = "#ff5500") +
  geom_point(data = noLeaders, aes(enabling, effect), alpha = .5) +
  theme_minimal()

gamTestLambda9 = gam(effect ~ s(enabling, sp = .9, k = 40), data = noLeaders)

p = predict(gamTestLambda9, type = "lpmatrix")

beta = coef(gamTestLambda9)

s = p %*% beta

plotDat = cbind.data.frame(s = s, enabling = na.omit(noLeaders$enabling))

gam9Plot = ggplot(plotDat, aes(enabling, s)) + 
  geom_line(color = "#ff5500") +
  geom_point(data = noLeaders, aes(enabling, effect), alpha = .5) +
  theme_minimal()

library(gridExtra)

gridExtra::grid.arrange(gam1Plot, gam9Plot)

In the top plot, we have allowed our line a bit more flexibility to wiggle – you can see the line bending more to fit the pattern within your data. We are going to get very good in-sample prediction here, at the expense of out-of-sample prediction. The bottom plot, is a bit more reserved. It will undoubtedly do better out-of-sample, but might be missing something within the in-sample data.

Decision Trees

Decision trees are great for classification (i.e., which category does something belong to). For the type of decision tree that we are going to use, we can think about it almost like we would a logistic model. There are, however, some major differences. The first is that we feed any number of predictors to the tree. At our root node, we will have all of the data and all of the predictors. A scan of the predictor variables is then computed that will result in a “pure” branch – it will select a variable and a split of that variable that will result in a good separation between “0” and “1”. We then create another branch and repeat the process. We do this until we have all of the observations classified.

One beauty of the decision tree is in its ability to select important variables. One drawback, however, is that it does this in a greedy fashion. It finds what splits best and goes with it at that node; it does this without consideration of what variables might help to split better later on.

Another great thing about these trees is the ability to use classification or regression.

library(RCurl)

url = "https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CleanCreditScoring.csv"

creditScores = getURL(url)

creditScores = read.csv(textConnection(creditScores))
library(caret)

library(rpart)

library(rpart.plot)

library(e1071)

set.seed(10001)

tree1 = rpart(Status ~ Income + Savings + Debt, 
              data = creditScores, method = "class")

rpart.plot::prp(tree1, box.palette = "Blues", extra = 8)

If we notice the “extra” argument in our prp function, we will see an 8. This option is giving us the probability of belonging to that class (e.g., for everyone with a savings over 2, there is a .78 percent chance of them having good credit). There are many options, so be sure to look at the help file for the different options and play around with them.

confusionMatrix(predict(tree1, type = "class"), 
                creditScores$Status)
Confusion Matrix and Statistics

          Reference
Prediction  bad good
      bad   131   92
      good 1118 3105
                                          
               Accuracy : 0.7278          
                 95% CI : (0.7145, 0.7409)
    No Information Rate : 0.7191          
    P-Value [Acc > NIR] : 0.09917         
                                          
                  Kappa : 0.1015          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.10488         
            Specificity : 0.97122         
         Pos Pred Value : 0.58744         
         Neg Pred Value : 0.73526         
             Prevalence : 0.28093         
         Detection Rate : 0.02946         
   Detection Prevalence : 0.05016         
      Balanced Accuracy : 0.53805         
                                          
       'Positive' Class : bad             
                                          

Let’s check one out with some more variables:

set.seed(10001)

noRecodes = creditScores %>% 
  dplyr::select(-ends_with("R"))

bigTree = rpart(Status ~ ., data = noRecodes, 
                method = "class")

rpart.plot::prp(bigTree, box.palette = "Blues", extra = 8)

confusionMatrix(predict(bigTree, type = "class"), 
                noRecodes$Status)
Confusion Matrix and Statistics

          Reference
Prediction  bad good
      bad   480  182
      good  769 3015
                                          
               Accuracy : 0.7861          
                 95% CI : (0.7737, 0.7981)
    No Information Rate : 0.7191          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3821          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3843          
            Specificity : 0.9431          
         Pos Pred Value : 0.7251          
         Neg Pred Value : 0.7968          
             Prevalence : 0.2809          
         Detection Rate : 0.1080          
   Detection Prevalence : 0.1489          
      Balanced Accuracy : 0.6637          
                                          
       'Positive' Class : bad             
                                          

And we can mess around with a few different parameters. The “cp” argument represents the complexity parameter (whether a node decreases the lack of fit) and the minsplit represents the minimum number of observations that can be in any bucket.

set.seed(10001)

noRecodes = creditScores %>% 
  dplyr::select(-ends_with("R"))

overgrownTree = rpart(Status ~ ., data = noRecodes,
                      method = "class",
                      control = rpart.control(cp = 0, 
                                              minsplit = 5))

rpart.plot::prp(overgrownTree, box.palette = "Blues", extra = 8)

confusionMatrix(predict(overgrownTree, type = "class"), 
                noRecodes$Status)
Confusion Matrix and Statistics

          Reference
Prediction  bad good
      bad  1139   81
      good  110 3116
                                          
               Accuracy : 0.957           
                 95% CI : (0.9507, 0.9628)
    No Information Rate : 0.7191          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8929          
 Mcnemar's Test P-Value : 0.04276         
                                          
            Sensitivity : 0.9119          
            Specificity : 0.9747          
         Pos Pred Value : 0.9336          
         Neg Pred Value : 0.9659          
             Prevalence : 0.2809          
         Detection Rate : 0.2562          
   Detection Prevalence : 0.2744          
      Balanced Accuracy : 0.9433          
                                          
       'Positive' Class : bad             
                                          

We can see what happens when we reduce the number of observation in any one split and we completely relax the need to improve our fit at every node – our tree is allowed to grow almost unchecked. Although our decision tree is now absolutely crazy and nearly impossible to follow, we have greatly improved our accuracy. Again, this in-sample accuracy is likely going to come with the cost of poor out-of-sample accuracy.

For the love of all that is holy, is this really necessary? Here is a big hint – if you have a limited set of predictors with clearly linear relationships that are bound by concrete theory, then a linear regression model will outperform these trees. If, on the other hand, you have more than a few predictors and no clear theory to guide variable selection, then the trees will be great.

Random Forest

Decision trees are clearly a lot of fun and have a wide variety of uses. While sometimes one tree is helpful, everything I know about horror movies leads me to believe that a single tree (perhaps on a desolate hill) always leads to some type of danger. If we remember the issue with trees being greedy and just splitting things how it finds it best, we can imagine what might happen if we only consider one tree.

This brings us to ensemble methods. If we think about what an ensemble is, we can start to guess what ensemble methods might be. There are many ensemble methods, but the one we are going to talk about now is random forests. Trees..forest…ensemble…are the connections starting to happen?

Forest is probably less pretentious than “tree ensemble”, but where does the “random” come into play? What would be the likely result of running the exact same tree 1000 times? Some minor differences aside, almost nothing. In our random forest, we will create randomness in 2 ways: by drawing a random sample of observations for each tree and by drawing a random sample of predictors for each tree.

library(randomForest)

set.seed(10001)

rfTest = randomForest(Status ~ ., data = noRecodes, 
                      importance = TRUE, 
                      ntree = 1000)

rfTest

Call:
 randomForest(formula = Status ~ ., data = noRecodes, importance = TRUE,      ntree = 1000) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 20.78%
Confusion matrix:
     bad good class.error
bad  624  625  0.50040032
good 299 2898  0.09352518

Here we see some simple summary output, with our out-of-bag error rate and the same type of confusion matrix we saw earlier. You might be wondering what “out-of-bag” means, and rightfully so! What makes the random forest random? Instead of wasting the unsampled data, it uses it as a testing set (i.e., the data that was not in the sample “bag” gets used to help perform a very basic cross validation).

We can also get the more elaborate confusion matrix from caret (note that the matrix is flipped from the previous output!):

confusionMatrix(rfTest$predicted, noRecodes$Status, positive = "good")
Confusion Matrix and Statistics

          Reference
Prediction  bad good
      bad   624  299
      good  625 2898
                                         
               Accuracy : 0.7922         
                 95% CI : (0.7799, 0.804)
    No Information Rate : 0.7191         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.4412         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9065         
            Specificity : 0.4996         
         Pos Pred Value : 0.8226         
         Neg Pred Value : 0.6761         
             Prevalence : 0.7191         
         Detection Rate : 0.6518         
   Detection Prevalence : 0.7924         
      Balanced Accuracy : 0.7030         
                                         
       'Positive' Class : good           
                                         

Next, let’s look at our variable importance:

importance(rfTest)
                 bad      good MeanDecreaseAccuracy MeanDecreaseGini
Seniority 45.1052118 32.213205            55.891401        177.68237
Home      18.3746102 23.034527            31.121596         84.76986
Time       9.2540182 21.972602            24.889852         59.91084
Age       -0.3968928 24.139723            20.981706        128.09684
Marital    5.4525639  9.937768            12.344187         36.81541
Records   67.0224249 57.555206            81.899248        115.10560
Job       43.8688689 46.721606            62.004885        105.26650
Expenses   1.2742437 20.331422            18.725488         84.03249
Income     9.5274372 34.550613            40.144686        159.98766
Assets    32.5556150 32.230098            47.235940        112.50443
Debt       4.4347507  4.077064             6.054631         37.13761
Amount    12.8418494 31.957065            37.353952        138.63217
Price      3.9116821 25.000101            26.328744        150.57211
Finrat    25.3953746 35.631124            44.856744        195.33558
Savings   25.7414402 39.852088            52.361732        196.66277
varImpPlot(rfTest)

We can look at the “MeanDecreaseAccuracy” and the “MeanDecreaseGini” to get a sense of how important each variable might be. The help file for importance sheds light on how “MeanDecreaseAccuracy” is computed, but all you really need to know is that it is telling us how poorly a model performed without the variable (variables with a higher value are more important for predicition). Gini is always used as an inequality measure, but it is conceptualized as node impurity here – variables with higher values means that removing those variables resulted in greater node impurity.

The Black Box!

Now we know what variables are important for our random forest, so when do we interpret our trees from the forest!?! We can’t…so we don’t!!! Interpreting a single tree is statistically non-sense when we are dealing with a random forest. We are really only concerned with prediction at this point.

Outside of Breiman, that does not satisfy anybody.

# devtools::install_github('araastat/reprtree')

library(reprtree)

repTree = ReprTree(rfTest, noRecodes)
[1] "Constructing distance matrix..."
[1] "Finding representative trees..."
plot(repTree)

What happened? While we do not want to overgrow a tree, a random forest does not care in the slightest. In fact, it will grow the deepest tree that it possibly can. If you would say that this causes over-fitting, then you would be absolutely correct. Remember, though, that we are generating a lot of trees and using all of these trees to votes on the outcomes. Therefore, the over-fitting within one tree really does not make a big differenece – it will come out in the wash. But, you can see where interpretting a tree from a forest is practically nonsense!

While interpretting a tree is indeed nonsense, we can do some work with exploring what happens within the tree and specific observations a bit more. This is certainly not requisite, thus the output of the code is not provided, but you could use the following code from the lime package to do some exploring on your own.

library(lime)

sampleObs = sample.int(n = nrow(noRecodes),
                       size = floor(.75 * nrow(noRecodes)), replace = F)

train = noRecodes[sampleObs, ]

test = noRecodes[-sampleObs, ]

modelTrain = train(Status ~ ., data = train, method = 'rf')

explainer = lime(train, modelTrain)

explanation = explain(test, explainer, n_labels = 2, n_features = 3)

head(explanation)

plot_features(explanation)

This is but one type of classification tree/random forest scenario. The party package has some different tree/forest algorithms. Let’s take a look at a random forest fit with conditional inference. Instead of using something like the Gini index to maximize information, conditional inference trees use significance testing to perform the splits.

library(party)

crfTest = cforest(Status ~ ., data = noRecodes, 
                  controls = cforest_control(ntree = 1000))

Let’s see if we did any better:

confusionMatrix(predict(crfTest), noRecodes$Status, positive = "good")
Confusion Matrix and Statistics

          Reference
Prediction  bad good
      bad   824  132
      good  425 3065
                                          
               Accuracy : 0.8747          
                 95% CI : (0.8646, 0.8843)
    No Information Rate : 0.7191          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.666           
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9587          
            Specificity : 0.6597          
         Pos Pred Value : 0.8782          
         Neg Pred Value : 0.8619          
             Prevalence : 0.7191          
         Detection Rate : 0.6894          
   Detection Prevalence : 0.7850          
      Balanced Accuracy : 0.8092          
                                          
       'Positive' Class : good            
                                          

We certainly did! This all boils down to tweaking the parameters of your forest. It is not terribly hard to make incremental improvements to your prediction rates, but sometimes you can make pretty significant jumps just by switching up the way the nodes are split.

We could get a sample tree from this forest, but we are not going to do it. One important thing to note is that party will not create a deep tree by default.

Recommender Systems

Overview

While provacative statements are often fun to make, you will not get one here. Instead, I would prefer to play it safe and make an assumption that you have used some form of online commerce during the past year. Maybe you bought something off Amazon or you used Netflix to make poor choices around bedtime. If you have used any such services, you have interacted with a recommender system. Recommender systems are a class of models designed to make predictions about items that people might like/endorse/want, despite never having seen those items. It is exactly how Netflix makes recommendations (see how that works?) and how Amazon gives you items inspired by your viewing history.

Everytime you click, watch, rate, or buy, you are helping to train these recommender systems. Imagine yourself as “Person A”. As “Person A”, you have rated the show Ash Vs. Evil Dead highly (and watched every available episode). Another person, likely “Person B”, has done the same. “Person B” also provided a very high rating for the movie Bubba Ho-tep. Since you and “Person B” both enjoyed Ash Vs. Evil Dead, a video service would likely recommend Bubba Ho-tep to you.

While our little contrived scenario serves as a clean example, recommender systems are great when there are many items (as is the case with nearly every online entity).

Package Installation

While we are going to use recommenderlab (for ease and speed), recosystem is an excellent package and one that you should keep in mind.

install.packages("recommenderlab")

install.packages("recosystem")

devtools::install_github("tarashnot/SlopeOne")

devtools::install_github("tarashnot/SVDApproximation")

Recommender Systems Types

Like most statistical techniques, recommender systems are a veritable 31 flavors. However, your data and needs will largely dictate the precise method that is used.

Content-based

Sometimes, items have a discreet set of descriptors and users can endorse those descriptors that they like. For example, you might like music that evokes memories of cruising in your Testerosta under pink and blue neon lights (possibly in Miami). There is a whole genre of music that fits this bill and is tagged appropriately on platforms like Pandora. When you offer a thumbs up or down, then you are helping to train the system about what you like and what features of the music are important to you. If Pandora plays a song without synthesizers and I provide a thumbs down, then the system knows that I might not like songs without synth. Next, I will likely give a thumbs up to something with synth. Finally a will give a thumbs down to a track with synth and vocals. Pandora now knows that I will put a large weight on the synth feature and they will continue offering me content accordingly. Depending upon the system, I might run into an issue with diversity if I am too specific or all of my weight is on one feature.

The content-based approach works well when there is just a single content type to be offered (music, movies, news, etc.). However, they do not perform well when the content is mixed (how well do you think Pandora could recommend news to you given your musical preferences). To that end, other techniques might be considered.

Collaborative Filtering

Fortunately, collaborative filtering tells us what is going on by name alone! Through a collaborative effort, items are filtered to you – people who act like you help to filter better predicted items to you. Our previous example about Ash Vs. Evil Dead is an excellent example of collaborative filtering.

Another interesting thing to note, though, is that we have two sources of information in a collaborative filtering model: the user and the item. The goal in both is the same, but the mechanism is slightly different.

User-based

In user-based collaborative filtering models, the goal is the find another user that shares similar patterns with the target user. At its root, it is very much akin to distance-based Nearest Neighbor matching. In small systems, user-based models can work well. However, large and/or dynamic systems are not kind to user-based models. Computing a similarity between every user in large systems is expensive and this can be compounded when users preference change/update. Another issue – one that Amazon identified quickly – is that having many items but few ratings of the items can perform poorly. To that end, some brilliant folks from Amazon decided to try something new.

Item-based

Instead of focusing on similarities between users, item-based models (also called item-item models) focus on the mean ratings of the items. At a certain point, the mean of the item won’t bounce around too terribly much and then similarities can be computed between items. If I purchase an item, there is a high probabilitiy that I would also enjoy an item that is similarly rated.

Cold Start Problems

What happens when you are brand new to anything that uses collaborative filtering? Can any reasonable predictions be made? Not really. This is an issue with data sparsity. Imagine a matrix with every person that has bought something from Amazon on the rows and every item from Amazon on the columns – this would be an incredibly large and sparse matrix. That is why some services ask you some questions from the get-go and Netflix throws some popular things out that might be of interest to you. Given the massive amounts of metadata that you produce (time of day, navigation patterns, devices used), many systems are never without some kind of context, but concrete behavior is very helpful.

Another way to deal with sparsity is to perform what is frequently known as model-based collaborative filtering. It sounds fancy, but it really isn’t anything that you haven’t covered already! Imagine that we take our matrix and perform some type of dimension reduction on it.

Hybrids

While your particular needs will mostly drive the decision, we do not always have to pick just one method and stick with it forever. Instead, we can create hybrid systems. These systems allow us to combine pieces of both of collaborative filtering and content-based models to achieve better predictions. If you were to come up with a video service where you allowed users to rate videos and show them videos with similar attributes to those that are rated highly (pulling from content-based filtering) and also take into account how a users viewing/searching habits are like other people (from collaborative filtering), you might just have a million dollar idea on your hands…I wonder why nobody has done such a thing…

Example

It is very important to note that the example here is going to be very generic. You might wonder what the fun is in that and I would certainly wonder the same thing. As noted before, recommender systems are computationally expensive. You will notice that we are running our model on a very small set (20%) of the data – you want to actually be able to finish this model and see results. If your machine is a bit older, it might chug on the models.

We are going to demo both main types of the collaborative filtering model.

User-based Collaborative Filtering

We can get the top recommendations for a few people given a trained model and predict ratings:

library(recommenderlab)

data("Jester5k")

ubcfTest = Recommender(Jester5k[1:1000], method = "UBCF")

ubcfRecom = predict(ubcfTest, Jester5k[1001:1005])

as(ubcfRecom, "list")
$u20089
 [1] "j10"  "j55"  "j1"   "j72"  "j93"  "j100" "j81"  "j9"   "j3"   "j70" 

$u11691
 [1] "j89"  "j98"  "j93"  "j91"  "j92"  "j73"  "j76"  "j100" "j88"  "j74" 

$u2646
 [1] "j14" "j72" "j12" "j6"  "j89" "j83" "j78" "j96" "j76" "j45"

$u8426
 [1] "j89"  "j100" "j98"  "j94"  "j97"  "j90"  "j72"  "j88"  "j93"  "j84" 

$u18250
 [1] "j89"  "j76"  "j93"  "j91"  "j96"  "j88"  "j100" "j85"  "j98"  "j83" 
ubcfPredictRating = predict(ubcfTest, Jester5k[1001:1005], 
                            type = "ratings")

as(ubcfPredictRating, "list")
$u20089
         j1          j2          j3          j4          j9         j10 
-0.18606472 -1.69237808 -0.61481930 -2.39912551 -0.51535708  0.68832207 
        j22         j23         j24         j25         j30         j33 
-2.04607082 -1.33763398 -1.38571982 -1.46730127 -1.21279006 -2.33520117 
        j37         j43         j44         j45         j47         j51 
-2.43506322 -1.96736262 -1.07455739 -1.11121356 -1.07954026 -2.29830266 
        j52         j55         j57         j58         j59         j60 
-2.29146789 -0.03947522 -3.12376136 -3.96257205 -2.44887694 -1.85488548 
        j63         j64         j67         j70         j71         j72 
-2.07872950 -1.77801425 -2.11980871 -0.70987087 -2.58715566 -0.21825293 
        j73         j74         j75         j76         j77         j79 
-1.52219624 -2.46801224 -2.39158643 -1.20860830 -2.49307746 -2.93104466 
        j80         j81         j82         j83         j84         j85 
-1.38176987 -0.47794431 -3.38405423 -2.15290598 -0.89489739 -0.75140801 
        j86         j87         j89         j90         j91         j92 
-2.87919467 -0.91506529 -0.89539157 -3.25045341 -1.74826816 -1.98510824 
        j93         j94         j95         j96         j97         j98 
-0.27227971 -1.67751644 -1.08853718 -2.12969902 -1.17342305 -1.09561458 
        j99        j100 
-2.05330591 -0.35511837 

$u11691
         j3          j4          j9         j24         j30         j33 
-0.10997420 -0.62247064 -0.85521276 -2.09410293 -1.18183580 -2.19240862 
        j37         j41         j57         j58         j59         j60 
-2.64312242 -1.60523436 -3.35176329 -3.00026607 -2.05958717 -2.89155860 
        j64         j71         j73         j74         j75         j76 
-1.03516334 -0.50982149  0.63875251  0.37566534  0.31813915  0.61790196 
        j77         j78         j80         j81         j82         j83 
 0.04386737  0.03776578  0.18632927 -0.37789176 -0.32844826  0.09008047 
        j84         j85         j86         j87         j88         j89 
-0.27971397 -1.09259219 -0.08705976  0.07758730  0.41660109  1.05256999 
        j90         j91         j92         j93         j94         j95 
-0.99064777  0.74433759  0.64981841  0.94011006 -1.03811877 -0.25782565 
        j96         j97         j98         j99        j100 
 0.28986370 -0.62320836  0.97541566 -1.24293609  0.46506480 

$u2646
        j1         j2         j3         j4         j6         j9 
-1.8158835 -1.8369514 -3.2861250 -3.0950636 -0.9599751 -3.0700649 
       j10        j11        j12        j14        j22        j24 
-2.0200899 -2.3860338 -0.7247804 -0.3873996 -2.8043553 -2.4522703 
       j25        j30        j33        j37        j38        j40 
-1.9776526 -2.0821196 -2.8677597 -3.1480525 -1.7307025 -2.0185173 
       j41        j43        j44        j45        j51        j55 
-2.8900902 -2.9653842 -3.8222027 -1.5826495 -2.5325435 -2.0997833 
       j57        j58        j60        j63        j64        j67 
-2.9362257 -3.6407675 -3.7075817 -3.2844756 -3.7008180 -2.3348311 
       j70        j71        j72        j73        j74        j75 
-2.1795389 -3.2130440 -0.6115631 -1.9066992 -2.5006417 -2.6763121 
       j76        j77        j78        j79        j80        j81 
-1.5779565 -1.8523654 -1.4364049 -1.9853804 -1.9916548 -2.3136842 
       j82        j83        j84        j85        j86        j87 
-3.1278709 -1.3424441 -2.7792060 -2.3485366 -3.3417027 -1.6187393 
       j88        j89        j90        j91        j92        j93 
-2.2341518 -1.0468588 -3.7990946 -2.5124673 -1.7996271 -1.6032134 
       j94        j96        j97        j98        j99       j100 
-2.8486634 -1.5002727 -2.3749898 -2.4496628 -2.5890311 -2.9351828 

$u8426
     j71      j72      j73      j74      j75      j76      j77      j78 
1.763926 2.844330 2.064189 1.945741 2.026668 2.618147 2.087283 2.305231 
     j79      j80      j81      j82      j83      j84      j85      j86 
2.683440 2.717686 1.972459 2.226532 2.770031 2.815541 2.573262 2.108831 
     j87      j88      j89      j90      j91      j92      j93      j94 
2.546303 2.843528 3.196710 2.857387 2.792360 2.773411 2.819905 3.078388 
     j95      j96      j97      j98      j99     j100 
2.446361 2.526136 3.017407 3.093657 2.495765 3.123495 

$u18250
     j71      j72      j73      j74      j75      j76      j77      j78 
5.947317 6.299860 5.994378 6.076744 5.847068 6.550990 5.864459 6.174978 
     j79      j80      j81      j82      j83      j84      j85      j86 
6.100444 6.021890 6.308158 6.119615 6.331735 5.893564 6.369233 6.118635 
     j88      j89      j90      j91      j93      j94      j95      j96 
6.375128 6.605151 6.071887 6.397380 6.528446 5.913060 6.258110 6.382495 
     j97      j98      j99     j100 
5.853568 6.337283 6.093690 6.371520 

Item-based Collaborative Filtering

ibcfTest = Recommender(Jester5k[1:1000], method = "IBCF")

ibcfRecom = predict(ibcfTest, Jester5k[1001:1005])

as(ibcfRecom, "list")
$u20089
 [1] "j97" "j90" "j70" "j30" "j86" "j59" "j63" "j94" "j82" "j71"

$u11691
 [1] "j33" "j37" "j59" "j97" "j64" "j57" "j41" "j86" "j95" "j94"

$u2646
 [1] "j88" "j76" "j96" "j92" "j94" "j93" "j78" "j55" "j30" "j84"

$u8426
 [1] "j79" "j84" "j86" "j71" "j74" "j75" "j77" "j95" "j94" "j98"

$u18250
 [1] "j84" "j82" "j94" "j71" "j95" "j85" "j86" "j99" "j90" "j77"
ibcfPredictRating = predict(ibcfTest, Jester5k[1001:1005], 
                            type = "ratings")

as(ibcfPredictRating, "list")
$u20089
        j1         j2         j3         j4         j9        j10 
-3.6205321 -7.5229129 -1.4391137 -1.3889552 -0.6189486 -5.6755132 
       j22        j23        j24        j25        j30        j33 
-2.5407151 -2.9002208 -0.7644542 -2.9160664  1.6783319 -1.9385028 
       j37        j43        j44        j45        j47        j51 
-0.6601058 -0.6508645 -0.5383921 -0.8501978 -2.7539340 -2.6863429 
       j52        j55        j57        j58        j59        j60 
-1.7733936 -2.3186280 -0.6461654 -1.0724435  0.4323534 -0.9114580 
       j63        j64        j67        j70        j71        j72 
 0.4147624 -0.1481822 -0.5985703  1.7908505  0.2509258 -0.8960271 
       j73        j74        j75        j76        j77        j79 
-1.2345176 -0.9094863 -0.3830234 -2.5212052 -0.8324549 -0.7450601 
       j80        j81        j82        j83        j84        j85 
-1.4661141 -1.1085947  0.2600464 -2.7033535 -1.5685415 -3.2818597 
       j86        j87        j89        j90        j91        j92 
 1.1709215 -4.8808445 -0.4856941  2.1166004 -0.6167418 -4.4546985 
       j93        j94        j95        j96        j97        j98 
-1.5267968  0.4114216 -1.7620931 -2.4643837  2.9106994 -2.6079573 
       j99       j100 
-1.0630701 -1.0080002 

$u11691
         j3          j4          j9         j24         j30         j33 
 0.32640060  1.33685601  0.27681377  1.11634329  0.79762786  2.54928232 
        j37         j41         j57         j58         j59         j60 
 2.03821443  1.61470402  1.71655653  1.16267661  1.86224492  1.40970408 
        j64         j71         j73         j74         j75         j76 
 1.83293676  0.68732601 -0.81125146  1.15857126 -0.09910178 -1.15389516 
        j77         j78         j80         j81         j82         j83 
 0.10436067 -1.90398599 -1.11375220  0.70627679  0.95424946 -0.86125941 
        j84         j85         j86         j87         j88         j89 
 0.74540963 -0.07566627  1.59732233 -1.23424746  0.44761557  0.65275203 
        j90         j91         j92         j93         j94         j95 
 0.71269923 -0.44761401 -1.10114635 -0.55724121  1.41071449  1.45644584 
        j96         j97         j98         j99        j100 
 1.25910972  1.83589555  0.33375841  1.08776853 -0.94569395 

$u2646
        j1         j2         j3         j4         j6         j9 
-3.7643646 -2.9512010 -1.5129619 -1.8988503 -4.6633212 -2.6120816 
       j10        j11        j12        j14        j22        j24 
-1.8505295 -3.4667847 -2.2063839 -2.4884973 -1.4910869 -1.6602673 
       j25        j30        j33        j37        j38        j40 
-2.0636546 -0.4758708 -2.0368016 -3.3075357 -1.3970700 -3.9081927 
       j41        j43        j44        j45        j51        j55 
-1.8286770 -2.1299526 -2.5018919 -1.2410665 -2.9227329 -0.3607317 
       j57        j58        j60        j63        j64        j67 
-2.4147751 -2.1266107 -1.3062268 -2.7773672 -2.0002061 -2.5906979 
       j70        j71        j72        j73        j74        j75 
-2.6119790 -3.0746560 -2.1468924 -3.0213681 -2.2740970 -1.9417306 
       j76        j77        j78        j79        j80        j81 
 0.9866539 -1.9556999 -0.3257765 -0.7807587 -5.2186093 -1.5808528 
       j82        j83        j84        j85        j86        j87 
-1.7458272 -3.7245876 -0.6615445 -2.8486741 -1.3416623 -1.9940062 
       j88        j89        j90        j91        j92        j93 
 1.4857077 -2.0627184 -1.7362316 -2.9726296  0.1470699 -0.1996794 
       j94        j96        j97        j98        j99       j100 
 0.1097609  0.4660366 -2.3281891 -1.2608227 -1.8464379 -3.0624484 

$u8426
        j71         j72         j73         j74         j75         j76 
 4.48489677  0.97993006  0.17113622  3.97988881  3.97526304  1.66718290 
        j77         j78         j79         j80         j81         j82 
 3.92868596  0.79615452  4.54818081  0.47094448  1.41146733  3.15797523 
        j83         j84         j85         j86         j87         j88 
-0.31840282  4.54551424  2.73270388  4.53077763  0.04227635 -0.41842378 
        j89         j90         j91         j92         j93         j94 
 2.80179002  2.87175836  0.86651171  0.05013727  2.01313958  3.44676954 
        j95         j96         j97         j98         j99        j100 
 3.75661190  1.61788235  3.17673038  3.36024303  3.27607787  2.52676938 

$u18250
     j71      j72      j73      j74      j75      j76      j77      j78 
7.472495 5.684735 4.356324 6.802610 6.861523 5.663692 6.954942 5.545492 
     j79      j80      j81      j82      j83      j84      j85      j86 
6.280012 4.867291 6.126106 8.071872 4.545116 8.113706 7.280704 7.198826 
     j88      j89      j90      j91      j93      j94      j95      j96 
5.335152 5.902523 6.992913 5.652798 6.224675 7.582607 7.328307 6.075204 
     j97      j98      j99     j100 
6.718550 6.437228 7.010913 6.101415 

Compare the results of our two models. How where they different? Did the same jokes get predicted? You also have the power to find the recommended jokes (you can index into JesterJokes with the given number).

Here is u18250’s top recommended joke:

JesterJokes[84]
                                                                                                                                                j84 
"Q: What is the difference between Mechanical Engineers and Civil Engineers? A: Mechanical Engineers build weapons, Civil Engineers build targets." 

What a knee-slapper!

Evaluation and Comparison

For a recommender system, inference practically flies out the window. We are going to focus our attention on the prediction quality. As with other matters of prediction, the classic root mean squared error will be serving as our determiner of quality.

First we will create our training/testing split.

e = evaluationScheme(Jester5k[1:1000], method = "split", 
                     train = 0.9, given = 15,  goodRating = 5)
e
Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 1000 x 100 rating matrix of class 'realRatingMatrix' with 72358 ratings.

We can now train a user-based model:

ubcfTrain = Recommender(getData(e, "train"), "UBCF")

And an item-based model:

ibcfTrain = Recommender(getData(e, "train"), "IBCF")

Now we can make our predictions:

ubcfPredict = predict(ubcfTrain, getData(e, "known"), type="ratings")

ibcfPredict = predict(ibcfTrain, getData(e, "known"), type="ratings")

You can explore these just like we did earlier.

Finally, we can take a look at our errors:

rbind(UBCF = calcPredictionAccuracy(ubcfPredict, getData(e, "unknown")), 
      IBCF = calcPredictionAccuracy(ibcfPredict, getData(e, "unknown")))
         RMSE      MSE      MAE
UBCF 4.705376 22.14056 3.697923
IBCF 5.343162 28.54938 4.141209

Since we want as little error as possible, we can conclude that the user-based collaborative filter performed better on root mean squared error, mean squared error, and mean average error.

Sentiment Analysis

Words As Data

Words are everywhere. Believe it or not, you are reading words right now! Given our penchant for taking things and making numbers out of them, you are probably already guessing that we can somehow make words tell a story with numbers. If that is what you are guessing, then you are absolutely correct.

Processing Text

Before we can even begin to dive into analyzing text, we must first process the text. Processing text involves several steps that will be combined in various ways, depending on what we are trying to accomplish.

Stemming

Tense aside, are jumped, jump, and jumping the same thing? Yes, but what if we compare the actual strings? On a string comparison side, are they the same? No. We have a string with 6, 4, and 7 characters, respectively.

What if we remove the suffixes, “ed” and “ing” – we are left with three instances of “jump”? Now we have something that is equivalent in meaning and in a string sense. This is the goal of stemming.

Let’s take a look to see how this works (you will need to install tm and SnowballC first):

jumpingStrings = c("jump", "jumping", "jumped", "jumper")

tm::stemDocument(jumpingStrings)
[1] "jump"   "jump"   "jump"   "jumper"

We got exactly what we expected, right? You might have noticed that “jumper” did not get stemmed. Do you have any idea why? Let’s think through it together. “Jump”, “jumping”, and “jumped” are all verbs related to the act of jumping. “Jumper”, on the other hand, is a person who jumps – it is a noun. Martin Porter’s stemming algorithm works incredibly well!

Hopefully, this makes conceptual sense; however, we also need to understand why we need to do it. In a great many text-based methods, we are going to create a matrix that keeps track of every term (i.e., word) in every document – this is know as a document-term matrix. If we know that “jump”, “jumping”, and “jumped” all refer to the same thing, we want it just represented once within our document-term matrix.

Shall we take a look?

library(tm)

documents = c("I like to jump.", 
              "I have jumped my whole life.", 
              "Jumping is in my blood.", 
              "I am a jumper.")

documentsCorp = tm::SimpleCorpus(VectorSource(documents))

documentsDTM = DocumentTermMatrix(documentsCorp)

inspect(documentsDTM)
<<DocumentTermMatrix (documents: 4, terms: 9)>>
Non-/sparse entries: 9/27
Sparsity           : 75%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs blood have jump jumped jumper jumping life like whole
   1     0    0    1      0      0       0    0    1     0
   2     0    1    0      1      0       0    1    0     1
   3     1    0    0      0      0       1    0    0     0
   4     0    0    0      0      1       0    0    0     0

We can see that without stemming, we have 9 terms (things like “I”, “a”, and “to” get removed automatically). Let’s do some stemming now:

documentsStemmed = stemDocument(documents)

documentsStemmed
[1] "I like to jump."            "I have jump my whole life."
[3] "Jump is in my blood."       "I am a jumper."            

And now the document-term matrix:

stemmedDocCorp = tm::SimpleCorpus(VectorSource(documentsStemmed))

stemmedDocDTM = DocumentTermMatrix(stemmedDocCorp)

inspect(stemmedDocDTM)
<<DocumentTermMatrix (documents: 4, terms: 7)>>
Non-/sparse entries: 9/19
Sparsity           : 68%
Maximal term length: 6
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs blood have jump jumper life like whole
   1     0    0    1      0    0    1     0
   2     0    1    1      0    1    0     1
   3     1    0    1      0    0    0     0
   4     0    0    0      1    0    0     0

If we are trying to find documents that are covering similar content or talking about similar things, this document-term matrix will help to draw better conclusions, because it is clear that the first three documents are talking about the act of jumping and this document-term matrix reflects that.

This is going to be especially important for next week’s topic: topic models.

Stop Words

Some words do us very little good: articles, prepistions, and very high frequency words. These are all words that need to be removed. Fortunately, you don’t have to do this on your own – a great many dictionaries exist that contain words ready for removal.

tm::stopwords("en")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"      

Removing stopwords takes little effort!

documents = c("I like to jump.", 
              "I have jumped my whole life.", 
              "Jumping is in my blood.", 
              "I am a jumper.")

tm::removeWords(documents, words = stopwords("en"))
[1] "I like  jump."          "I  jumped  whole life."
[3] "Jumping    blood."      "I   jumper."           

We can even include custom stopwords:

tm::removeWords(documents, words = c("blood", stopwords("en")))
[1] "I like  jump."          "I  jumped  whole life."
[3] "Jumping    ."           "I   jumper."           

Regular Expressions

Stemming and stopword removal will go a long way to helping us process and clean our text. There are times, however, when you need complete control over what you remove. You might need to remove punctuation and other symbols, in addition to a wide array of things that come up in typical analyses (emoticons and wacky encoding characters!). When those cases pop up, you might need to make use of regular expressions to help. You should have already learned about regular expressions, but be prepared to circle back to them constantly when engaging in any matter of data science shenanigans!

Many text processors will deal with hyphenated words. If, however, your data is coming from the web, then what looks like a hyphen might not really be a hyphen!

exampleStrings = c("This is a regularly-occuring hyphen", 
                   "This is something that more closely--resembles a manually-produced em dash", 
                   "This is something that came from the darkest recesses of ill—conceived web design")

We can turn to our regular expression to see what we can do:

gsub(pattern = "-", replacement = " ", exampleStrings)
[1] "This is a regularly occuring hyphen"                                              
[2] "This is something that more closely  resembles a manually produced em dash"       
[3] "This is something that came from the darkest recesses of ill—conceived web design"

Using gsub and just specifying a regular “-” will remove just about everything, except for whatever is in the last sentence. What is it? It looks like a hyphen, but it is not. To remove it, we will need to get creative:

gsub(pattern = "(?!\\w)\\S", " ", exampleStrings, perl = TRUE)
[1] "This is a regularly occuring hyphen"                                              
[2] "This is something that more closely  resembles a manually produced em dash"       
[3] "This is something that came from the darkest recesses of ill conceived web design"

By using some pattern matching, we are able to get out everything between words, no matter what it might be. This particular little bit of regex is using what is known as a negative lookahead. We have our main expression, the non-space character, \S. Next, we put an expression in front of our main expression (this is the chunk in the parenthesis that starts with “?!”). When we read this, we would say, “don’t match any word characters and then find non-whitespace characters.”

While this is certainly a contrived example and would need additional robustness testing, it goes to show you that working with interesting data will yield interesting problems. I have been telling you all along that nothing is ever as it seems.

Text Processing Tools

There are several R packages that will help us process text. The tm package is popular and automates most of our work. You already saw how we use the stemming and stopword removal functions, but tm is full of fun stuff and allows for one pass text processing.

documents = c("I like to jump.", 
              "I have jumped my whole life.", 
              "Jumping is in my blood.", 
              "I am a jumper.")

documentCorp = SimpleCorpus(VectorSource(documents))

stopWordRemoval = function(x) {
  removeWords(x, stopwords("en"))
}

textPrepFunctions = list(removePunctuation,
                         stemDocument,
                         stopWordRemoval,
                         removeNumbers,
                         stripWhitespace)

tm_map(documentCorp, FUN = tm_reduce, tmFuns = textPrepFunctions)

Once you get your text tidied up (or even before), you can produce some visualizations!

library(dplyr)

library(tidytext)

library(wordcloud2)

txExecutions = read.csv("txEx.csv", stringsAsFactors = FALSE)

txExecutions %>% 
  dplyr::select(correctedStatements, inmateNumber) %>% 
  unnest_tokens(word, correctedStatements) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 25) %>% 
  na.omit() %>% 
  wordcloud2(shape = "cardioid")

Sentiment Analysis

Sentiment analysis is commonly used when we want to know the general feelings of what someone has written or said. Sentiment analysis is commonly seen applied to Twitter and other social media posts, but we can use it anywhere where people have written/said something (product reviews, song lyrics, final statements).

Sentiment can take many different forms: positive/negative affect, emotional states, and even financial contexts.

Let’s take a peak at some simple sentiment analysis.

Simple Sentiment

Let’s consider the following statements:

library(tidytext)

statement = "I dislike programming, but I really love R."

tokens = data_frame(text = statement) %>% 
  unnest_tokens(tbl = ., output = word, input = text)

tokens
# A tibble: 8 x 1
  word       
  <chr>      
1 i          
2 dislike    
3 programming
4 but        
5 i          
6 really     
7 love       
8 r          

From there, we get every individual token. This brings us to a quick, but necessary, detour. When we are talking about text, tokens are simply elements that have some general meaning. We typically associate tokens with individual words, but we could go deeper than that (spaces, punctuation, n-grams). While we won’t dive into them too deeply, n-grams are also interesting. Just like our typical notion of n, an n-gram is an n length group of words. We can set n to be anything, but we would typically look at 2-gram and 3-grams chunks.

tokenizers::tokenize_ngrams(statement, n = 2)
[[1]]
[1] "i dislike"           "dislike programming" "programming but"    
[4] "but i"               "i really"            "really love"        
[7] "love r"             

These are helpful for looking at frequently occuring combinations of words. They are also going to be helpful for us in just a bit.

Now back to the matter at hand.

Now, we can compare the tokens within our statement to some pre-defined dictionary of positive and negative words.

library(tidyr)

tokens %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
# A tibble: 1 x 3
  negative positive sentiment
     <dbl>    <dbl>     <dbl>
1        1        1         0

When we use Bing’s dictionary, we see that we get one positive word (love) and negative word (dislike) with a neutral overall sentiment (a sentiment of 0 would indicate neutrality, while anything above 0 has an increasing amount of positivity and anything below 0 has an increasing amount of negativity).

Do you think that disklike and love are of the same magnitude? If I had to make a wild guess, I might say that love is stronger than dislike. Let’s switch out our sentiment library to get something with a little better notion of polarity magnitute.

tokens %>%
  inner_join(get_sentiments("afinn"))
# A tibble: 2 x 2
  word    score
  <chr>   <int>
1 dislike    -2
2 love        3

Now this looks a bit more interesting! “Love” has a stronger positive polarity than “dislike” has negative polarity. So, we could guess that we would have some positive sentiment.

If we divide the sum of our word sentiments by the number of words within the dictionary, we should get an idea of our sentences overall sentiment.

tokens %>%
  inner_join(get_sentiments("afinn")) %>% 
  summarize(n = nrow(.), sentSum = sum(score)) %>% 
  mutate(sentiment = sentSum / n)
# A tibble: 1 x 3
      n sentSum sentiment
  <int>   <int>     <dbl>
1     2       1       0.5

Our sentiment of .5 tells us that our sentence is positive, even if only slightly so.

While these simple sentiment analyses provide some decent measures to the sentiment of our text, we are ignoring big chunks of our text by just counting keywords.

For example, it is probably fair to say that “really love” is stronger than just “love”. We might want to switch over to some techniques that consider n-grams and other text features to calculate sentiment.

Smarter Sentiment Analysis

When we use sentiment analysis that is aware of context, valence (“love” is stronger than “like”), modifiers (e.g., “really love”), and adversative statements (“but,…”, “however,…”), we get a better idea about the real sentiment of the text.

We will use the jockers sentiment library, but many more available. Depending on your exact needs, there are some dictionaries designed for different applications. A prime example, and one that should be near to our hearts, is the Loughran and McDonald dictionary for financial documents (e.g., SEC 10K filings). Tim Loughran and Bill McDonald are superstars in the Finance Department at Notre Dame! There are dictionaries designed to measure certain attitudes and opinions (e.g., disgust, excitedness, sadness) and even dictionaries to measure emoji sentiment.

Before we engage in our whole sentiment analysis, let’s take a look at a few things.

Here is the dictionary that jockers will use.

lexicon::hash_sentiment_jockers
                 x     y
    1:     abandon -0.75
    2:   abandoned -0.50
    3:   abandoner -0.25
    4: abandonment -0.25
    5:    abandons -1.00
   ---                  
10734:     zealous  0.40
10735:      zenith  0.40
10736:        zest  0.50
10737:      zombie -0.25
10738:     zombies -0.25

You might want to use View() to get a complete look at what is happening in there.

We should also take a peak at our valence shifters:

lexicon::hash_valence_shifters
              x y
  1: absolutely 2
  2:      acute 2
  3:    acutely 2
  4:      ain't 1
  5:       aint 1
 ---             
136:    whereas 4
137:      won't 1
138:       wont 1
139:   wouldn't 1
140:    wouldnt 1

With all of that out of the way, let’s get down to the matter at hand:

library(sentimentr); library(lexicon); library(magrittr)

statement = "I dislike programming, but I really love R."

sentiment(statement, polarity_dt = lexicon::hash_sentiment_jockers)
   element_id sentence_id word_count sentiment
1:          1           1          8  0.516188

We can see that we get a much stronger sentiment score when we include more information within the sentence. While the first part of our sentence starts out with a negative word (dislike has a sentiment value of -1.6), we have an adversarial “but” that will downweight whatever is in the initial phrase and then we will have the amplified (from “really”) sentiment of “love” (with a weight of 3.2 in our dictionary).

With all of this together, we get a much better idea about the sentiment of our text.

Example Text

While the text that we have used so far serves its purpose as an example quite well, we can always take a look at other written words.

Let’s consider the following data:

txExecutionsShort = txExecutions %>% 
  dplyr::select(inmateNumber, correctedStatements) %>% 
  filter(inmateNumber == 529|
           inmateNumber == 843|
           inmateNumber == 714|
           inmateNumber == 999253)


sentiment(get_sentences(txExecutionsShort), 
          polarity_dt = lexicon::hash_sentiment_jockers) %>% 
  group_by(inmateNumber) %>% 
  summarize(meanSentiment = mean(sentiment))
# A tibble: 4 x 2
  inmateNumber meanSentiment
         <int>         <dbl>
1          529        0.0972
2          714        0.0763
3          843        0.561 
4       999253        0.0121

Other Text Fun

Sentiment analysis is always a handy tool to have around. You might also want to explore other descriptive aspects of your text.

The koRpus package allows for all types of interesting types descriptives. There are a great number of readability and lexical diversity statistics (Fucks is likely my favorite).

We need to tokenize our text in a manner that will please koRpus.

library(koRpus)

statementTokens = lapply(txExecutionsShort$correctedStatements, function(x) tokenize(x, 
                           format = "obj", lang = "en"))

statementTokens
[[1]]
        token      tag lemma lttr       wclass
1        What word.kRp          4         word
2          is word.kRp          2         word
3       about word.kRp          5         word
4          to word.kRp          2         word
5   transpire word.kRp          9         word
6          in word.kRp          2         word
                                         [...]
172      well word.kRp          4         word
173        by word.kRp          2         word
174       all word.kRp          3         word
175    T.D.C. abbr.kRp          6 abbreviation
176 personnel word.kRp          9         word
177         .     .kRp          1     fullstop
                                          desc stop stem
1                          Word (kRp internal) <NA> <NA>
2                          Word (kRp internal) <NA> <NA>
3                          Word (kRp internal) <NA> <NA>
4                          Word (kRp internal) <NA> <NA>
5                          Word (kRp internal) <NA> <NA>
6                          Word (kRp internal) <NA> <NA>
                                                        
172                        Word (kRp internal) <NA> <NA>
173                        Word (kRp internal) <NA> <NA>
174                        Word (kRp internal) <NA> <NA>
175                Abbreviation (kRp internal) <NA> <NA>
176                        Word (kRp internal) <NA> <NA>
177 Sentence ending punctuation (kRp internal) <NA> <NA>

[[2]]
     token      tag lemma lttr      wclass
1        I word.kRp          1        word
2        '    ''kRp          1 punctuation
3        m word.kRp          1        word
4    sorry word.kRp          5        word
5      for word.kRp          3        word
6     what word.kRp          4        word
                                     [...]
14    this word.kRp          4        word
15       .     .kRp          1    fullstop
16   Jesus word.kRp          5        word
17 forgive word.kRp          7        word
18      me word.kRp          2        word
19       .     .kRp          1    fullstop
                                         desc stop stem
1                         Word (kRp internal) <NA> <NA>
2                        Quote (kRp internal) <NA> <NA>
3                         Word (kRp internal) <NA> <NA>
4                         Word (kRp internal) <NA> <NA>
5                         Word (kRp internal) <NA> <NA>
6                         Word (kRp internal) <NA> <NA>
                                                       
14                        Word (kRp internal) <NA> <NA>
15 Sentence ending punctuation (kRp internal) <NA> <NA>
16                        Word (kRp internal) <NA> <NA>
17                        Word (kRp internal) <NA> <NA>
18                        Word (kRp internal) <NA> <NA>
19 Sentence ending punctuation (kRp internal) <NA> <NA>

[[3]]
   token      tag lemma lttr   wclass
1      I word.kRp          1     word
2     am word.kRp          2     word
3     so word.kRp          2     word
4   glad word.kRp          4     word
5      I word.kRp          1     word
6  found word.kRp          5     word
                                [...]
10    am word.kRp          2     word
11    so word.kRp          2     word
12 happy word.kRp          5     word
13   for word.kRp          3     word
14    it word.kRp          2     word
15     .     .kRp          1 fullstop
                                         desc stop stem
1                         Word (kRp internal) <NA> <NA>
2                         Word (kRp internal) <NA> <NA>
3                         Word (kRp internal) <NA> <NA>
4                         Word (kRp internal) <NA> <NA>
5                         Word (kRp internal) <NA> <NA>
6                         Word (kRp internal) <NA> <NA>
                                                       
10                        Word (kRp internal) <NA> <NA>
11                        Word (kRp internal) <NA> <NA>
12                        Word (kRp internal) <NA> <NA>
13                        Word (kRp internal) <NA> <NA>
14                        Word (kRp internal) <NA> <NA>
15 Sentence ending punctuation (kRp internal) <NA> <NA>

[[4]]
       token      tag lemma lttr      wclass
1        Yes word.kRp          3        word
2          I word.kRp          1        word
3         do word.kRp          2        word
4          .     .kRp          1    fullstop
5      There word.kRp          5        word
6        has word.kRp          3        word
                                       [...]
389        I word.kRp          1        word
390        '    ''kRp          1 punctuation
391        m word.kRp          1        word
392 finished word.kRp          8        word
393  talking word.kRp          7        word
394        .     .kRp          1    fullstop
                                          desc stop stem
1                          Word (kRp internal) <NA> <NA>
2                          Word (kRp internal) <NA> <NA>
3                          Word (kRp internal) <NA> <NA>
4   Sentence ending punctuation (kRp internal) <NA> <NA>
5                          Word (kRp internal) <NA> <NA>
6                          Word (kRp internal) <NA> <NA>
                                                        
389                        Word (kRp internal) <NA> <NA>
390                       Quote (kRp internal) <NA> <NA>
391                        Word (kRp internal) <NA> <NA>
392                        Word (kRp internal) <NA> <NA>
393                        Word (kRp internal) <NA> <NA>
394 Sentence ending punctuation (kRp internal) <NA> <NA>
lapply(statementTokens, function(x) readability(x, index = "Flesch.Kincaid", quiet = TRUE))
[[1]]

Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 5.83 
         Age: 10.83 

Text language: en 

[[2]]

Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 0.56 
         Age: 5.56 

Text language: en 

[[3]]

Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 2.51 
         Age: 7.51 

Text language: en 

[[4]]

Flesch-Kincaid Grade Level
  Parameters: default 
       Grade: 1.45 
         Age: 6.45 

Text language: en 

We already know that text is everywhere and our sentiment analysis was a reasonable crack at extracting some meaning from text. A common line of inquiry relates to what is expressed within the text. In traditionally social sciences, this was done through some form of content analysis – a coding scheme was developed by researchers, people read each comment and assigned it a code, some agreement statistics were computed, and any discrepencies were hashed out by the researchers. When this was the only option, it was servicable. Could you do this for thousands of tweets? With time, maybe. Could you do this with a few million articles? No. This is where topic models, and latent dirichlet allocation, come to save the day.

What follows borrows heavily from documents produced by Michael Clark and by a collab between Michael Clark and Seth Berry.

Latent Dirichlet Allocation

We will start with a brief demonstration of a standard latent dirichlet allocation (LDA) for topic modeling. The basic idea is to first generate some documents based on the underlying model, and then we’ll use the topicmodels package to recover the topics via LDA.

Suffice it to say, one can approach this in (at least) one of two ways. In one sense, LDA is a dimension reduction technique, much like the family of techniques that includes PCA, factor analysis, non-negative matrix factorization, etc. We will take a whole lot of terms, loosely defined, and boil them down to a few topics. In this sense, LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. The following is the plate diagram and description for standard LDA from Blei, Jordan, and Ng (2003).

  • \(\alpha\) is the parameter of the Dirichlet prior on the per-document topic distributions
  • \(\eta\) is the parameter of the Dirichlet prior on the per-topic word distribution
  • \(\theta_m\)θm is the topic distribution for document m
  • \(\beta_k\) is the word distribution for topic k
  • \(z_{mn}\) is the topic for the n-th word in document m
  • \(w_{mn}\) is the specific word

Both z and w are from a multinomial draw based on the θ and β distributions respectively. The key idea is that, to produce a given document, one draws a topic, and given the topic, words are drawn from it.

Generating Documents

In the standard setting, to be able to conduct such an analysis from text, one needs a document-term matrix, where rows represent documents, and columns terms. Terms are typically words but could be any n-gram of interest. In practice, this is where you’ll spend most of your time, as text is never ready for analysis, and must be scraped, converted, stemmed, cleaned etc. We will initially create θ and β noted above, then given those, draw topics and words given topics based on the multinomial distribution.

Topic Probabilities

We begin the simulation by creating topic probabilities. There will be k = 3 topics. Half of our documents will have probabilities of topics for them (\(\theta1\)) which will be notably different from the other half (\(\theta2\)). Specifically, the first half will show higher probability of topic “A” and “B”, while the second half of documents show higher probability of topic “C”. What we’ll end up with here is an m X k matrix of probabilities θ where each m document has a non-zero probability for each k topic. Note that in the plate diagram, these would come from a Dirichlet(α) draw rather than be fixed like this, but hopefully this will make things clear starting out.

library(tidyverse)

nDocs = 500                                       # Number of documents

wordsPerDoc = rpois(nDocs, 100)                   # Total words/terms in a document

thetaList = list(c(A = .60, B = .25, C = .15),    # Topic proportions for first and second half of data 
                 c(A = .10, B = .10, C = .80))    # These values represent a Dir(alpha) draw

theta_1 = t(replicate(nDocs / 2, thetaList[[1]]))

theta_2 = t(replicate(nDocs / 2, thetaList[[2]]))

theta = rbind(theta_1, theta_2)      

Topic Assignments and Labels

With topic probabilities in hand, we’ll draw topic assignments from a categorical distribution.

firsthalf = 1:(nDocs / 2)
secondhalf = (nDocs / 2 + 1):nDocs

Z = apply(theta, 1, rmultinom, n = 1, size = 1)   # draw topic assignment

rowMeans(Z[, firsthalf])                           # roughly equal to theta_1
[1] 0.596 0.256 0.148

Now, we can see which is the most likely topic.

z = apply(Z, 2, which.max) 

We have our list of documents and each document’s topic assignment.

Topics

Next we need the topics themselves. Topics are probability distributions of terms, and in what follows we’ll use the Dirichlet distribution to provide the prior probabilities for the terms. With topic A, we’ll make the first ~40% of terms have a higher probability of occurring, the last ~40% go with topic C, and the middle more associated with topic B. To give a sense of the alpha settings, alpha = c(8, 1, 1) would result in topic probabilities of .8, .1, .1, as would alpha = c(80, 10, 10). We’ll use the gtools package for the rdirichlet function.

nTerms = max(wordsPerDoc)

breaks = quantile(1:nTerms, c(.4,.6,1)) %>% round()

cuts = list(1:breaks[1], (breaks[1] + 1):breaks[2], 
            (breaks[2] + 1):nTerms)

library(gtools)

B_k = matrix(0, ncol = 3, nrow = nTerms)

B_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])),    # topics for 1st 40% of terms
                                  rep(1,  length(cuts[[2]])),
                                  rep(1,  length(cuts[[3]]))))

B_k[,2] = rdirichlet(n=1, alpha=c(rep(1,  length(cuts[[1]])),    # topics for middle 20%
                                  rep(10, length(cuts[[2]])),
                                  rep(1,  length(cuts[[3]]))))

B_k[,3] = rdirichlet(n=1, alpha=c(rep(1,  length(cuts[[1]])),    # topics for last 40%
                                  rep(1,  length(cuts[[2]])),
                                  rep(10, length(cuts[[3]]))))

Here is a visualization of the term-topic matrix, where the dark represents terms that are notably less likely to be associated with a particular topic.

library(ggplot2)

as.data.frame(B_k) %>%
  mutate(document = 1:nrow(.)) %>% 
  tidyr::gather(key = topic, value = prob, -document) %>% 
  ggplot(., aes(topic, document, color = prob)) +
  geom_tile()

Remember how we specified this – we assigned 40% to topic 1, 40% to topic 3, and left the middle 20% to topic 2. This visualization clearly shows this.

Now, given the topic assignment, we draw words for each document according to its size via a multinomial draw, and with that, we have our document-term matrix. However, we can also think of each document as merely a bag of words, where order, grammar etc. is ignored, but the frequency of term usage is retained.

wordlist_1 = sapply(1:nDocs, 
                    function(i) t(rmultinom(1, size = wordsPerDoc[i], prob = B_k[, z[i]])), 
                    simplify = FALSE)  

# smash to doc-term matrix
dtm_1 = do.call(rbind, wordlist_1)

colnames(dtm_1) = paste0('word', 1:nTerms)

# bag of words representation
wordlist_1 = lapply(wordlist_1, function(wds) rep(paste0('word', 1:length(wds)), wds))

If you print out the wordlist_1 object, you will see the words asssociated with each document.

Topic Models

Now with some theory under our belt, we can take a look at analyzing some data. Just to keep everything light, we will be looking at the Last Statement of inmates executed in Texas.

Just like our sentiment analysis, there is a fair chunk of cleaning to do:

library(stm)

executions = read.csv("txEx.csv", stringsAsFactors = FALSE)

executionsText = textProcessor(documents = executions$correctedStatements, 
                           metadata = executions)
Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Stemming... 
Creating Output... 
executionsPrep = prepDocuments(documents = executionsText$documents, 
                               vocab = executionsText$vocab,
                               meta = executionsText$meta)
Removing 1240 of 2305 terms (1240 of 12857 tokens) due to frequency 
Your corpus now has 425 documents, 1065 terms and 11617 tokens.

The stm package has some pretty nice facilities for determining a number of topics:

kTest = searchK(documents = executionsPrep$documents, 
             vocab = executionsPrep$vocab, 
             K = c(3, 4, 5, 10, 20), verbose = FALSE)

plot(kTest)

The 4 plots that are returned are going to try to help us determine the best number of topics to take. I like to focus on semantic coherence (how well the words hang together) and the residuals. We want to have low residual and high semantic coherence. The residuals definitely take a sharp dive as we increase K. If we consider one of the major assumptions of LDA, we could almost always guess that we would need a great number of topics (i.e., if every topic that has ever existed, existed before writing, then we could have a huge numebr of topics). Our coherence seems to do pretty well with some of the lower numbers (not entirely surprising). With all of these together, we can settle on 5 topics for our subsequent analyses.

It is worthwhile to note that we can also do model selection with the stm package, but that is some work that will be best done if you want some additional playtime.

With our 5 topics, we can start our actual model:

topics5 = stm(documents = executionsPrep$documents, 
             vocab = executionsPrep$vocab, 
             K = 5, verbose = FALSE)

We can get a lot of output from our model, but we can focus on the expected topic proportions plot:

plot(topics5)

We are essentially looking at the proportion of each topic, with a few of the highest probability words.

This is great, but it is really fun to see what emerges from the topics.

labelTopics(topics5)
Topic 1 Top Words:
     Highest Prob: love, thank, life, support, peopl, way, brother 
     FREX: live, allah, appreci, thank, bye, row, learn 
     Lift: cousin, maria, mauri, open, pam, within, afford 
     Score: thank, love, bye, allah, support, appreci, live 
Topic 2 Top Words:
     Highest Prob: know, love, want, yall, say, sorri, dont 
     FREX: alright, watch, dont, worri, mom, tell, care 
     Lift: alba, ass, daddi, fall, fault, folk, forc 
     Score: yall, dont, love, kid, sorri, tell, care 
Topic 3 Top Words:
     Highest Prob: will, kill, stay, peopl, innoc, murder, get 
     FREX: kill, innoc, murder, must, stop, justic, fight 
     Lift: adam, bed, church, dna, error, forgotten, heard 
     Score: kill, black, polic, innoc, must, evid, america 
Topic 4 Top Words:
     Highest Prob: famili, like, hope, peac, god, one, friend 
     FREX: receiv, famili, friend, peac, victim, like, mother 
     Lift: behalf, behold, final, grate, knowledg, known, lay 
     Score: peac, like, famili, god, rejoic, unto, propheci 
Topic 5 Top Words:
     Highest Prob: forgiv, lord, will, god, sorri, jesus, ask 
     FREX: forgiv, sin, thi, holi, prais, amen, accept 
     Lift: accept, action, anointest, bread, comfort, etern, everlast 
     Score: forgiv, thi, holi, thou, lord, jesus, amen 

Let’s focus on the frex words (they occur frequently within the topic and are exclusive to that topic) and the highest probability words (i.e., the words that have the highest probability of occuring within that topic). The Lift and Score words (just a few different ways of weighting occurance) can be useful, but are a bit less intuitive than the other two. Let’s put some names to the topics. Topic 1 is likely expressing “thanks for the love and the support”, topic 2 is something along the lines of “everything is going to be alright”, topic 3 is “you are murdering an innocent man”, topic 4 is probably “I hope your family finds peace”, and topic 5 is “I hope God forgives me”.

We can look at statements that have a high probability of being associated with each topic:

findThoughts(topics5, texts = executionsPrep$meta$correctedStatements, n = 1)

 Topic 1: 
     I would like to extend my love to my family members and my relatives for all of the love and support you have showed me. I extend my special love to my daughter who I love greatly. I hope that you forever remember me. I hope that you will always cherish the love and the strength that I have provided you. My love for you will remain with you winthin your heart and in part of your soul. As to all my brothers I love you all with all of my heart. But during your time of departure from this earth plane you will have to face the judgement of God for the lack of love you have shown my aunt and my cousins. We were never brought up to be that way. As you know our parents brought us up to love one another no matter what. There was no love showed to my aunt or none of my cousins. I can forgive you all but you must ask forgiveness from God for how you have hurt our aunt and our family. I leave now at this moment to join my parents and my only sister whose lives were not taken by me. To all the fellows on death row, I thank you for the love that you have shown me and for the strength that you provided me. You all keep your heads up. As for my attorney's I thank you all for being there for me. As defense attorneys you have shown me a lot strength. May my love touch each one of you all's souls as I leave this body. 
 Topic 2: 
     Hey, how y'all doing out there? I done lost my voice. Y'all be strong now, alright? Don, thanks man. I love you, Gloria, always baby. That's all I got to say. Hey, don't y'all worry about me, okay? 
 Topic 3: 
     I would like to say first of all the real violent crimes in this case are acts committed by James Boswell and Clay Morgan Gaines. We have the physical evidence to prove fabrication and cover-up. The people responsible for killing me will have blood on their hands for an unprovoked murder. I am not guilty; I acted in self-defense and reflex in the face of a police officer who was out of control. James Boswell had his head beat in; possibly due to this he had problems. My jurors had not heard about that. They did not know he had suffered a head injury from the beating by a crack dealer five months earlier; that he was filled with anger and wrote an angry letter to the Houston Chronicle. He expressed his frustration at the mayor, police chief and fire chief. He was mad at the world. Three and a half months before I worked on a deal with the DEA, the informant was let off. At the moment he left the courtroom, he became angry with me; Officer Boswell was upset about this. Officer Boswell and an angry woman were in the police car and they were talking in raised voices. In other words, Officer Boswell was angry at the time I walked up. Officer Boswell may have reacted to the... 
 Topic 4: 
     The following is the personal final statement of and by Michael L. McBride. The Beatitudes: Jesus lifted up his eyes on His disciples, and said, "Blessed be the poor: for yours is the kingdom of God. Blessed are ye that hunger now: for ye shall be filled. Blessed are ye that weep now: for ye shall laugh. Blessed are ye, when men shall hate you, and they shall separate you from their company, and shall reproach you, and cast out your name as evil for the Son of Man's sake. Rejoice ye in that day, and leap for joy: for behold, your reward is great in Heaven: for in the like manner did their fathers unto the prophets. But woe unto you that are rich! for ye have received your consolation. Woe unto you that are full! for ye shall hunger. Woe unto you that laugh now! for ye shall moan and weep. Woe unto you, when all men shall speak well of you! for so did their fathers to the false prophets. The supremacy of love over gifts: I Corinthians, Chapter 13: 4-8: Love is patient, love is kind, and is not jealous, love does not brag and is no arrogant, does not act unbecoming; it does not seek its own, is not provoked, does not take into account a wrong suffered, does not rejoice in unrighteousness, but rejoices with the truth; bears all things, believes all things, hopes all things, endures all things. Love never fails; but if there are gifts of prophecy, they will be done away; if there tongues, they will cease. Now abide faith, hope, love, these three: but the greatest of these is love. Poem: Do not stand at my grave and weep, I am not there I do not sleep. I am the diamond glints in the snow, I am the sunlight on the ripened grain. I am the gentle autumn rain. When you awaken in the morning's hush, I am the swift uplifting rush of quiet birds in circled flight, I am the soft stars that shine at night.  Do not stand at my grave and cry, I am not there. I did not die. Signed Michael L. McBride #903 May 11, 2000 Huntsville, Texas 
 Topic 5: 
     Mama Isabel told me to tell you hello. Holy, holy, holy! Lord God Almighty! Early in the morning our song shall rise to Thee; Holy, holy, holy, merciful and mighty! God in three Persons, blessed Trinity. Holy, holy, holy! Merciful and mighty. All Thy works shall praise Thy name, in earth, and sky, and sea; Holy, holy, holy, merciful and mighty! God in three Persons, blessed Trinity. Oh, our Father who art in heaven, holy, holy, holy be Thy name. Thy kingdom come, Thy will be done, on earth as it is in heaven. Give us this day our daily bread and forgive us our sin as we forgive our debtors. Lead us not into temptation, but deliver us from evil, for Thine is the kingdom and the power and the glory forever and ever. Now, Father, into Thy hands I commit my spirit. Amen.

We can see that the story we put to our topics makes sense, given what we see in the actual texts.